Thursday, 2020-06-25

clarkblooks like my fix is failing now, its the same error but in the openstack org not x org00:14
fungiin production it was failing on a variety of different namespaces00:15
clarkbI would expect us to process the list of projects in order but maybe we don't00:16
clarkbfungi: does anything about the change I wrote look wrong?00:18
clarkbmaybe we need to respect the link headers because it does some out of order pagination?00:19
clarkbrathre than assuming we can iterate one by one until the end00:20
clarkboh maybe urlencoding is a problem00:23
clarkbno I don't think that is it00:25
clarkblooking at the gitea logs from the job it doen't appear we are looping. we're just doing the first fetch00:28
openstackgerritClark Boylan proposed opendev/system-config master: Deal with gitea pagination of repo lists  https://review.opendev.org/73788200:56
openstackgerritClark Boylan proposed opendev/system-config master: Paginate all the gitea get requests  https://review.opendev.org/73788500:56
clarkbI don't think ^ will fix it but I wanted to make those cleanups anyway00:56
clarkbI'm not having any better ideas right now. WIll have to pick this up in the morning. (Also feel free to update if you think you see it)01:01
clarkbactually the first of my changes may be doing the correct thing but not the followup I was looking at the wrong log file01:04
clarkbthe first is still failing though which makes me wonder if there is a second issue to address01:05
ianwsorry i had to run out this morning but am back ... i'm a bit lost but let me know if i can help01:13
clarkbianw: basically the manage-projects job isn't running at all bceause we very quickly hit an http 409 error from gitea. The root cause seems to be an addition of pagination listing repos in gitea. We list all the gitea repos then use that list to check if we have to create new repos01:14
clarkbianw: but since we do an incomplete listing we try to create repos that already exist and get the 409 conflict01:14
*** mrunge has quit IRC01:14
clarkbianw: https://review.opendev.org/737882 aims to fix this but is still erroring with the same error implying we aren't listing things properly01:14
*** mrunge has joined #opendev01:14
clarkblooking at the gitea logs for the previous patchset of that change we are doing the looping of requests to get all of the repos01:15
clarkbI'm assuming the bug now is in the internal datastructure representing those lists of repos (whcih we check against to see if a project already exists)01:15
clarkbbut I just don't see it01:16
clarkband its getting late and I have cranky kids so hard to think01:16
clarkbianw: to be clear there isn't an immediate emergency. We just can't add or update projects right now01:16
clarkbif you want to poke at it feel free. Its all tested in that stack because the base of the stack sets up the job to run manage projects twice01:17
clarkbfirst time creates all the repos then second pass should noop successfully but it doesn't currently01:17
ianwok cool, i'm fresh eyes on all this so not sure much help but will have a poke01:18
clarkbprobably the next thing is to figure out how to get that ansible library to emit more logging of what the gitea repos it saw were and what repo it tried to create01:23
ianwyou read my mind :)01:23
clarkbcool I'll leave you to it then01:24
clarkbalso its really neat how easy it is to test this stuff01:24
*** DSpider has quit IRC01:31
*** cloudnull has joined #opendev01:42
ianwlooks like ps3 fixed it01:44
clarkboh really?02:01
clarkbmaybe it was a parameter issue then02:01
fungiclarkb: sorry, i had turned in for the evening, i can try to take a look in the morning if you haven't already worked it out02:02
fungiskimming, sounds like maybe you worked it out after all02:03
*** diablo_rojo has quit IRC03:15
*** shtepanie has quit IRC03:28
openstackgerritMerged opendev/grafyaml master: Drop Python 2 support  https://review.opendev.org/73766703:54
*** pmacdonnell has quit IRC03:56
openstackgerritMerged opendev/grafyaml master: Remove query variable refresh deprecation  https://review.opendev.org/73766404:00
*** ykarel|away is now known as ykarel04:21
*** ysandeep|away is now known as ysandeep04:43
openstackgerritIan Wienand proposed opendev/grafyaml master: Add import of json files  https://review.opendev.org/73790005:04
ianwglarkb/fungi: ^ so that gets us to something we talked about, where you can run a local grafana in a container, make your changes via UI and save the json to project-config for review/version control05:05
ianwclarkb even ^ :)05:05
ianwi just need to write the instructions for the grafana side now05:05
*** jaicaa has quit IRC05:22
*** jaicaa has joined #opendev05:23
*** ysandeep is now known as ysandeep|afk05:48
*** cloudnull has quit IRC06:14
*** rpittau|afk is now known as rpittau06:20
*** cloudnull has joined #opendev06:27
*** ysandeep|afk is now known as ysandeep06:44
openstackgerritIan Wienand proposed opendev/system-config master: Grafana container deployment  https://review.opendev.org/73740606:44
openstackgerritAndreas Jaeger proposed openstack/project-config master: Add pep8 jobs to grafyaml  https://review.opendev.org/73791506:55
*** hashar has joined #opendev06:59
openstackgerritIan Wienand proposed openstack/project-config master: Add all python versions to bindep tox testing  https://review.opendev.org/73528407:00
fricklerI haven't looked at that in some time, so don't know when it may have started, but I'm now seeing too large select buttons on https://review.opendev.org/#/admin/projects/openstack/neutron-dynamic-routing,access using firefox, leading to an overlap effect similar to what we had on etherpad. it may be an effect of my local settings, though07:14
*** sgw1 has quit IRC07:22
*** tosky has joined #opendev07:42
*** moppy has quit IRC08:01
*** moppy has joined #opendev08:01
openstackgerritJavier Peña proposed opendev/system-config master: Make the base role and playbook compatible with CentOS  https://review.opendev.org/73704308:14
*** hashar has quit IRC08:16
*** corvus has quit IRC08:17
*** hashar has joined #opendev08:22
*** corvus has joined #opendev08:30
*** ykarel is now known as ykarel|lunch08:39
*** hrw has joined #opendev08:46
hrwmorning08:46
yoctozeptohey infra - got a question about meetpad - does it support recording?08:47
openstackgerritJavier Peña proposed opendev/system-config master: Support CentOS for AFS mirror  https://review.opendev.org/73699609:13
*** sorin-mihai has joined #opendev09:28
*** aannuusshhkkaa has quit IRC09:33
*** DSpider has joined #opendev09:35
*** ysandeep is now known as ysandeep|afk09:39
*** bhagyashris is now known as bhagyashris|afk09:55
*** hashar has quit IRC09:57
*** ykarel|lunch is now known as ykarel09:58
openstackgerritDonny Davis proposed openstack/project-config master: Slowly Scale OE back up  https://review.opendev.org/73794109:59
*** ysandeep|afk is now known as ysandeep10:05
*** rpittau is now known as rpittau|bbl10:20
*** tkajinam has quit IRC10:22
openstackgerritMerged openstack/project-config master: Slowly Scale OE back up  https://review.opendev.org/73794110:27
openstackgerritSorin Sbarnea (zbr) proposed opendev/system-config master: Recognize LP urls for footer bugs  https://review.opendev.org/73796010:29
openstackgerritThierry Carrez proposed openstack/project-config master: Removing missed tripleo-ui references  https://review.opendev.org/73796110:31
frickleryoctozepto: currently not. jitsi does have a recoding component but we haven't deployed that afaik10:35
fricklerrecording10:35
yoctozeptofrickler: ack, thanks10:42
*** bhagyashris|afk is now known as bhagyashris11:00
*** ysandeep is now known as ysandeep|break11:17
*** sorin-mihai has quit IRC11:25
*** ysandeep|break is now known as ysandeep11:47
*** dpawlik6 has quit IRC11:54
openstackgerritMerged openstack/project-config master: Removing missed tripleo-ui references  https://review.opendev.org/73796112:08
openstackgerritAndreas Jaeger proposed openstack/project-config master: Retire networking-onos, openstack-ux, solum-infra-guest-agent: Step 1  https://review.opendev.org/73798712:11
*** dpawlik6 has joined #opendev12:19
*** hashar has joined #opendev12:22
openstackgerritAndreas Jaeger proposed openstack/project-config master: Retire networking-onos, openstack-ux, solum-infra-guest-agent: Step 1  https://review.opendev.org/73798712:26
openstackgerritAndreas Jaeger proposed openstack/project-config master: Finish retirement of networking-onos,openstack-ux,solum-infra-guestagent  https://review.opendev.org/73799212:26
*** rpittau|bbl is now known as rpittau12:42
*** hashar has quit IRC13:16
fungiyoctozepto: i've heard that with the right software you can locally record the browser window13:34
fungithough i don't personally know who's done that13:34
fungiand i expect gpu acceleration makes that complicated to capture13:34
openstackgerritOleksandr Kozachenko proposed openstack/project-config master: Add openstack/tempest-horizon in required project  https://review.opendev.org/73802413:39
openstackgerritOleksandr Kozachenko proposed openstack/project-config master: Add openstack/tempest-horizon in required project  https://review.opendev.org/73802413:44
Open10K8SHi Team13:45
Open10K8SPlease check this PS13:45
Open10K8Shttps://review.opendev.org/73802413:45
Open10K8SWaiting review on other PSs13:45
*** dpawlik6 is now known as dpawlik-213:48
*** dpawlik-2 is now known as danpawlik13:48
*** sgw has joined #opendev13:52
openstackgerritGhanshyam Mann proposed openstack/project-config master: Retire networking-l2gw and networking-l2gw-tempest-plugin  https://review.opendev.org/73803013:57
ttxfungi, clarkb: we should sync on when to approve https://review.opendev.org/#/c/737533/ so that I can watch a few runs and check everything behaves as expected14:06
fungittx: i can approve now if you're around to check results14:09
ttxI'm in a meeting, but my brain is not used at 100%, so yes14:09
openstackgerritGhanshyam Mann proposed openstack/project-config master: Retire networking-l2gw and networking-l2gw-tempest-plugin  https://review.opendev.org/73803014:10
ttxfungi: ^14:12
fungi"ignore_errors: zuul.newrev is defined" seems backwards to me, but that's probably just me not understanding ansible's backwards logic14:12
*** ryohayakawa has quit IRC14:12
fungii thought the idea was to ignore errors from mirroring if there is no zuul.newrev (because it was triggered from something other than a ref-updated event)14:13
ttxzuul.newrev is always defined in the post pipeline, so that 's equivalent to ignore_errors = true14:14
ttxthe goal being to ignore mirror failures as long as the reference is up14:14
mnaserfungi: is it ok to +W project-config changes given the current state of manage-projects?14:14
fungibut if you run it outside the post pipeline, say in check, "zuul.newrev is defined" evaluates false14:14
fungiso you're telling it not to ignore errors if run in check?14:15
ttxyes, it basically behaves as it currently does, if tested in check14:15
ttx(so the check test does not really test the new code... but it can't since the job actually runs under different conditions in post pipeline)14:16
openstackgerritGhanshyam Mann proposed openstack/project-config master: Final step for networking-l2gw and networking-l2gw-tempest-plugin retirement  https://review.opendev.org/73804014:17
fungittx: oh, got it, we won't hit this race condition in check/gate pipelines anyway14:18
ttxexactly14:18
funginow if gertty will stop hanging for a moment i can approve :/14:18
ttxfungi: how fast are zuul-jobs deployed once the change merges ? Should I just watch the promote job?14:23
corvusttx: that change will take effect immediately upon merge; so the next run of the job starting after the merge will use it14:23
ttxnoted! Will stand by14:24
openstackgerritMerged openstack/project-config master: Add openstack/tempest-horizon in required project  https://review.opendev.org/73802414:28
*** ysandeep is now known as ysandeep|away14:29
*** mlavalle has joined #opendev14:30
fungiyeah, zuul takes its job configuration from the git state on branches and knows as soon as they merge that it should start using them instead of the prior state14:31
openstackgerritMerged zuul/zuul-jobs master: upload-git-mirror: check after mirror operation  https://review.opendev.org/73753314:32
*** ykarel is now known as ykarel|away14:36
ttxnow waiting for something openstacky to actually merge14:43
fungirackspace just opened tickets letting us know about host outages impacting logstash-worker02 and nl0214:48
fungi#status log logstash-worker02.openstack.org rebooted by provider at 14:46z due to a hypervisor host outage14:53
openstackstatusfungi: finished logging14:53
fungi#status log nl02.openstack.org rebooted by provider at 14:46z due to a hypervisor host outage14:54
openstackstatusfungi: finished logging14:54
ttxlooking good so far14:56
openstackgerritClark Boylan proposed opendev/system-config master: Deal with gitea pagination of repo lists  https://review.opendev.org/73788215:01
openstackgerritClark Boylan proposed opendev/system-config master: Paginate all the gitea get requests  https://review.opendev.org/73788515:01
clarkbfungi: https://review.opendev.org/#/c/737883/ got squashed into 737882 which will allow us to land that stack (I set you as co author on the commit). I think that means you can abandon 73788315:02
clarkbinfra-root ^ Those two changes are ready for review and landing now I guess. The 737882 parent is the one we need for our current problems and 737885 is future proofing15:02
clarkbfor the previous patchsets you can see things appear to work properly starting at https://zuul.opendev.org/t/openstack/build/6c45d6e883454129be037b48e3f714a2/log/job-output.txt#18188 and https://zuul.opendev.org/t/openstack/build/31ccae26537e4ec2835d23e28e8e1d3f/log/job-output.txt#18223 for each of those changes15:05
mordredclarkb: nice15:10
ttxfungi: so the new playbook works well in the nominal case. Now I have to wait for the race condition to happen to see if it really solves it15:11
fungiyeah, that's always the hard part15:11
*** bhagyashris is now known as bhagyashris|afk15:19
AJaegerttx, push three changes and approve them together?15:20
ttxAJaeger: it's hit and miss, depends how fast they enqueue into the post pipeline15:20
AJaegerfun15:20
fungiis there anything we need to check in the wake of the nl02 outage? i suppose keeping all the state in zk mostly shields us from hung/leaked operations when a launcher is suddenly rebooted?15:21
ttxI'll see by tomorrow :) We usually have a couple issues per day15:21
AJaegerlooking forward to hear the results15:22
clarkbfungi: ya should be fine re nl0215:24
*** aannuusshhkkaa has joined #opendev15:57
*** diablo_rojo has joined #opendev16:02
openstackgerritClark Boylan proposed opendev/system-config master: Deal with gitea pagination of repo lists  https://review.opendev.org/73788216:10
openstackgerritClark Boylan proposed opendev/system-config master: Paginate all the gitea get requests  https://review.opendev.org/73788516:10
clarkbmordred: fungi ^ ianw's suggestion was important enough that I thought a new set of ps's would be a good idea16:11
clarkband this generates even more test data (and confidence!)16:11
*** rpittau is now known as rpittau|afk16:11
mordredclarkb: ++16:12
*** hashar has joined #opendev16:14
fungicool16:17
openstackgerritClark Boylan proposed opendev/system-config master: Increase parallelism of gitea project creation  https://review.opendev.org/73806416:43
clarkbmordred: corvus ^ I noticed that those TODOs could be cleaned up while working on the other thing. I'm not sure if I got that quite right and that is even less of an emergency but your input on it since you dealt with the original pass would be good16:43
corvusclarkb: seems legit; i don't recall details about your question in the TODO though.16:48
mordredme either - but also seem legit16:50
corvusi'm afk for a bit16:53
clarkbthanks I've WIP'd it just to be sure the other stuff lands first and we can stablizie before worrying about optimizing16:54
mordredclarkb: ok. I think it's time to try deploying an executor with docker16:59
mordredclarkb: ze01 is already stopped - so I'm going to start with it - sound ok?16:59
clarkbwfm16:59
mordredclarkb: I've disabled ansible - will wait for current playbooks to stop17:00
mordredclarkb: there are a couple of old ansible playbooks runs17:00
clarkbmordred: ok keep in mind the manage-projects backlog will be stopped up against that when the changes land, but we can also run that manually17:01
clarkbmordred: I think job timeouts are doing that17:01
clarkbbut not completely positive of that17:01
mordredclarkb: yeah - need to figure out what's going on there17:02
mordredrunning against ze0117:04
clarkbmordred: we have to manually stop the executor first then run the use docker playbook update?17:07
clarkbwe didn't encode the transition in the playbooks17:07
mordredthat's right17:07
mordredclarkb: actually - we seem to have landed the "disable old service" patch17:08
mordredclarkb: so - the playbook will turn off executor - but will not run docker compoose up17:09
clarkbhttps://zuul.opendev.org/t/openstack/build/72dedbb571374ccbbc7c9cc14e10f209/log/job-output.txt#18246 that was the latest pass of the pagination change which was just an update to add a comment in the zuul config17:09
clarkbthats making me think this fix is incomplete or buggy or racy17:10
clarkbfungi: ianw: ^ I think that must've been why it failed for me last night and I got all confused17:10
clarkbthinking out loud here, it could be a race for listing repos after creating all the repos?17:10
clarkbexcept we seem to have ~25 seconds between runs there17:11
mordredclarkb: oh. bong. zuul-executor is runnong on ze0117:11
mordredzuul-executor stop did not stop it17:12
clarkbmordred: well it should stop it in about 10 minutes17:12
clarkbit waits for all the ansible to stop running17:13
mordredah17:13
mordrednod17:13
* mordred thought this was one off - but forgot we did that whole big restart17:13
* mordred waits17:13
clarkbmordred: what I do is watch something like `ps -elf | grep zuul | wc -l` and that number should generall trend down17:14
clarkbI pulled the +W off of https://review.opendev.org/#/c/737882/5 and rechecked it in order to generate more data17:15
clarkbif anyone else has ideas for ^ they are more than welcome17:18
clarkbmordred: in particular doing better logging of the tool execution in ansible somehow would be useful17:18
clarkbbut I'm not sure how to expose that in ansible. Maybe just start writing to stdout and ansible captures that or?17:19
clarkbI guess we can hold the nodes too and try to rerun manually and see what happens17:20
* clarkb puts a hold on that change17:20
clarkbalright thats in place17:21
mordredclarkb: no - definitely don't just write to stdout from an ansible module17:23
mordredI thnik there is a log method now17:23
mordredon the module object17:23
clarkbmordred: cool if I catch one with the hold I can fiddle with finding that and using it in the python module17:23
mordredclarkb: cool17:24
clarkbbut also I think I can run it outside of the ansible context on the held setup17:24
mordredclarkb: ok - ze01 is running in docker17:24
clarkband then do normal python logging/tracebacks/etc17:24
clarkbmordred: now we want to see jobs on ze01 use afs properly?17:24
mordredclarkb: yeah. I've put ze* back in the emergency file and have removed DISABLE-ANSIBLE17:25
mordredso we can watch ze01 for a bit and make sure we're happy with it and evrything17:25
mordred#status log ze01 is running via docker now, ze* is still in emergency so we can watch ze0117:26
openstackstatusmordred: finished logging17:26
mnaseris it ok to merge project-config changes that touch manage-projects?17:28
fungimnaser: should be, they just aren't taking effect yet17:28
clarkband the fix is still failing occasionally for unknown reasons17:29
mnaser:(17:29
fricklerinfra-root: something seems wrong, likely related to nl02 reboot, our used nodes dropped and nl01 logs a lot of quota failures17:34
fricklermaybe the reboot left orphaned nodes?17:35
fricklerHttpException: 403: Client Error for url: https://iad.servers.api.rackspacecloud.com/v2/637776/servers, Quota exceeded for ram: Requested 817:36
frickler192, but already used 1441792 of 1536000 ram17:36
fricklerthat looks more like rackspace might have messed up their quota calculation17:36
openstackgerritMerged openstack/project-config master: Add Neutron Arista plugin charm to OpenStack charms  https://review.opendev.org/73779117:42
clarkbya nova can get out of sync17:48
openstackgerritMerged openstack/project-config master: Refresh openstack-ansible grafana dashboards  https://review.opendev.org/73774217:48
* frickler needs to eod, maybe someone can contact them17:48
fungifrickler: a quick server list for iad shows we have 186 instances booted there17:54
fungiso we may have leaked nodes?17:54
openstackgerritMerged openstack/project-config master: Add pep8 jobs to grafyaml  https://review.opendev.org/73791518:06
openstackgerritMerged openstack/project-config master: Add all python versions to bindep tox testing  https://review.opendev.org/73528418:06
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Test multiarch release builds  https://review.opendev.org/73731518:16
openstackgerritClark Boylan proposed opendev/system-config master: Add more logging to gitea project creation  https://review.opendev.org/73808318:23
clarkbmordred: ^ maybe something like that?18:23
*** Open10K8S has quit IRC18:24
*** Open10K8S has joined #opendev18:25
clarkbinterestingly it has failed twice in a row now18:25
clarkbok rerunning the playbook on my held node causes failure18:29
clarkbwhich is good because it likely rules out a race18:29
* clarkb tries the extra logging there now18:29
openstackgerritRafael Folco proposed openstack/diskimage-builder master: Enable py3 on dib release 7  https://review.opendev.org/73642118:31
fungiahh, looks like we have a number of nodes for rax-iad in error and shutoff states18:33
AJaegerinfra-root, https://zuul.opendev.org/t/openstack/build/c4948797c7994937bfa632105d06af93 fails with "No such file or directory: 'kinit': 'kinit'" - this is a promote docs job. Is kinit suddenly missing?18:40
clarkbmordred: ^ we may need to rollback ze01 docker container deployment18:40
clarkbAJaeger: I think it is likely that is related to running ze01 in a docker container18:40
fungiyeah, may have missed installing krb5 in the image18:40
corvusit was supposed to be in the image; we landed a change to add it18:41
corvusbut i agree, immediate resolution should be to stop that ze18:42
corvusi will do that now18:42
AJaegerthanks18:43
corvusi have issued the 'zuul-executor stop' command18:43
fungii have confirmed none of the 21 shutdown or error status instances in rax-iad appear in our `nodepool list` output, so i'll work on manually deleting them with osc18:44
clarkbI think I'm making progress on the gitea thing. in my test case its failing because openstack/telemetry-tempest-plugin isn't in the repo listing18:44
clarkbbut it was supposedly created on the first pass18:44
mordredcorvus: we only added openafs-krb5 - we didn't add krb5-user18:45
clarkband that seems consistent across multiple runs of the playbook on this test setup18:45
clarkbcurling the page for that repo seems to show it18:47
corvusfungi: do they lack the metadata that would tell nodepool to delete them?18:48
fungicorvus: i should have grabbed some samples, however i see there's one in dfw too, so i'll dig into it more closely18:50
corvusmaybe they lost their md due to whatever error happened18:50
fungii was at least able to `openstack server delete` all 21 strays in iad successfullt18:50
fungicorvus: the example in dfw does lack the additional nodepool_* properties18:52
fungiits properties field (according to openstack server show) is completely empty18:53
fungilooks like it's from 19 days ago18:53
fungioh, actually, it was created 2018-04-18 but updated 2020-06-06 for some reason18:53
fungiso yeah, this looks like it could be a source of infrequent node leaks18:54
corvusbooo :(18:55
fungianything else i should look at before i delete this one in dfw?19:02
fungiit claims to be over 2 years old, though i have a tough time believing it's been in our openstackjenkins tenant server list for dfw that entire time19:03
clarkbwell thats curious. I checked the cardinality of the gitea repo list and it matches the input list size. But then I convert to a set and now I'm off by one19:05
fungicardinality vs ordinality maybe?19:06
clarkbwell you wouldn't expect duplicates19:08
clarkbsomehow openstack/tempest is listed twice now to see if that is consistent19:08
clarkb(I'm wondering if this is a page boundary bug in gitea)19:08
fungioh, yeah, maybe they have problems with calculating offsets correctly19:10
clarkbconfirmed that it straddles pages19:14
fungiew19:14
clarkbpage 2 element 50 and page 3 element 119:14
clarkbopenstack/tempest19:14
clarkbhowdy19:14
fungii guess the list can be deduplicated as a workaround?19:14
fungior are we also missing entries because of this?19:15
fungii guess for each page we get a duplicate and lose an entry too19:15
clarkbfungi: that doesn't help because the problem is we don't have openstack/telementry-tempest-plugin in the list and that causes us to try and recreate openstack/telemetry-tempest-plugin and that fails19:15
clarkbno this is the only duplicate19:15
fungiyeah. we could run through the list twice with different prime-numbered page sizes19:15
clarkbwe can also fetch the repo page and see if it exists rather than usign the api to list them all19:16
clarkbbut I want to see if I can figure out why this happens in the first place and ya maybe I'll try some different page sizes19:16
clarkbis 17 prime?19:16
fungiyes19:16
clarkband maybe 31?19:17
fungi43 and 47 are the two largest primes under the 50 max19:17
clarkb17 changed the dup19:18
clarkbstill only one duplicate though19:18
clarkbI'm guessing the next step in debugging this is looking at the db and the paging code and figuring out the bug19:18
clarkbseparately we can do a double check and see if https://localhost:3000/org/project is a 404 or a 2XX19:18
clarkband only try to create if it isn't 2XX19:18
fungiyeah, the reason to run through twice with two different prime numbered page sizes is they're guaranteed not to share a period19:19
corvuswe don't have any repo creation happening during this right?19:19
corvusthe data set is supposed to be static during our queries?19:19
clarkbcorvus: correct, we do all the queries upfront19:20
clarkbfungi: ya and we could then combine them all and dedup the result19:20
fungigranted, if the mistaken offset is >1 you basically need n+1 different passes with different page sizes19:21
clarkbexcept in this case we seem to only ever get one dup for some reason19:21
clarkbprobably shouldn't rely on that behavior until we understand it though19:21
fungijust the other day someone asked me where prime numbers are useful in computer science. this would have made a great example19:22
clarkb43, 47, and 50 produce the same duplicate: openstack/tempest19:23
fungineat. is it the first or last entry?19:23
clarkbor wait I might have a bug in duplicate logging19:23
clarkbyup I do19:24
clarkbso ignore that observation19:24
clarkb(there are definitely duplicates and it changes based on page size and it is sometimes tempest)19:25
clarkb43 and 47 don't produce duplicates19:25
clarkband the play succeeds19:25
clarkbhttps://github.com/go-gitea/gitea/pull/1182719:32
clarkbit seems like its querying the db for repos where the owner id matches the provided id19:33
clarkbbut there is an ordered by that is "updated_unix DESC"19:34
clarkbmordred: ^ we're not changing order but maybe if that is a common value the order isn't stable?19:35
clarkbI guess my next step is to check the database19:35
clarkbbut it is lunch time. My hunch is that ordered by isn't stable19:35
clarkband if that is the case we can make a change to something more stable and in the mean time do a secondary fetch for https://localhost:3000/org/project and check that status19:36
fungianother possibility... off-by-one in the 50 max limit? maybe any page size <50 but not ==50 works correctly?19:36
clarkbfungi: setting it to 17 also fails. But maybe there is an off by one relative to any value19:37
clarkbI'll check the db directly after lunch19:37
AJaegerclarkb, fungi, https://review.opendev.org/737791 has added today a new repo - but I think your off-by-one happened already before that, didn't it?19:43
clarkbAJaeger: ya its a bug since we upgraded gitea19:45
clarkbwe should stop adding new projects for now19:45
clarkbhttp://paste.openstack.org/show/795231/ I think that mostly confirms the bug19:54
clarkbInternet says ordered by is not stable on successive requests19:54
clarkbI can't reproduce that via mysql client yet, but I'd be really surprised if there was a different issue19:54
mordredclarkb: yeah - so - if we're ordering only by updated unix - that's only seconds19:55
clarkbmordred: yup exaclty. I think teh fix in gitea is to ordered by id and updated_unix19:55
clarkbid is a proper key19:55
clarkband should make it stable19:55
mordredyes. that would be the right choice19:55
clarkbI'm going to fiddle with the mysql client to make sure I get it right but I'll make a PR for gitea19:56
clarkband then we can maybe deploy that ourselves and then my fix for the pagination should work?19:56
clarkbor we can hack up our ansible to check if the page for the repo exists as a fallback19:56
*** olaph has quit IRC19:57
mordredyah19:58
mordredclarkb: it should be ordered by updated_unix, id desc i believe19:58
mordred(since you want to order by updated_unix and then by id)19:58
*** hashar has quit IRC19:59
clarkbmordred: ya I think we need two DESC's though? ORDER BY updated_unix DESC , id DESC;20:00
mordredyeah, I thnk that's right20:01
clarkbcool I'm going to figure out two different things to push to github. One will be against master I can do a PR for then the other will be 1.12.0 + the fix that we can update our image to and then redo all the testing with that20:03
clarkbthis is easier said than done :)20:03
fungisounds great20:03
fungibut yeah, githubz20:03
mordred++20:04
clarkbhrm actually20:05
clarkbthis won't fix us in production will it?20:05
clarkbassuming that gerrit replication updates that timestamp we'll never have a full listing through that api20:06
fungiis there not an autoincrement index for the projects table?20:07
mordredare they doing this with order by limit?20:07
clarkbfungi: there is20:07
mordredit woudl be much better to just have this listing be sorted by id20:08
fungiyeah, that20:08
clarkbmordred: https://github.com/go-gitea/gitea/pull/11827/commits/8cc1b15245f06145c267f59146f4cb74c6330a1b first bit of diff there is the order by20:08
clarkbthey are not doing order by limit, its listing everything then chunking later I think20:08
clarkbWhat we can do is pass in opts from the api side to order by id and not updated_unix20:09
clarkband override the default20:09
* clarkb changes commit to do that20:09
mordredyeah20:09
mordredwe don't care about "most recent" for our use case20:10
mordredwe just want the list20:10
fungiand more importantly, the entire list20:14
mordredfungi: yeah - this is one of those cases where it would be really nice to be able to say "hi, please to not paginate"20:16
clarkbmordred: ya as I'm trying to figure out all the places this may be a bug in gitea I'm feeling the same way20:16
clarkbbut I'm giving up on that for now20:17
clarkbbecause my brain is melting20:17
mordredyup20:17
clarkbit kinda makes me think we should solve this differently in ansible20:20
clarkbbut let me push up change to try this version20:21
openstackgerritClark Boylan proposed opendev/system-config master: Deal with gitea pagination of repo lists  https://review.opendev.org/73788220:25
openstackgerritClark Boylan proposed opendev/system-config master: Paginate all the gitea get requests  https://review.opendev.org/73788520:25
openstackgerritClark Boylan proposed opendev/system-config master: Increase parallelism of gitea project creation  https://review.opendev.org/73806420:25
clarkbinfra-root ^ I'm not sure we want to merge it like that just yet, but that should exercise me gitea fix20:26
clarkbI'm filing a gitea bug now20:27
clarkband will do my PR and see if they think its sufficient or not20:27
clarkbhttps://github.com/go-gitea/gitea/issues/1205620:36
clarkbhttps://github.com/go-gitea/gitea/pull/1205720:38
*** icarusfactor has joined #opendev20:50
openstackgerritClark Boylan proposed opendev/system-config master: Deal with gitea pagination of repo lists  https://review.opendev.org/73810920:51
clarkbinfra-root ^ alternative approach with double checking20:51
*** factor has quit IRC20:52
clarkbI'm going to take a break now. I've got to do the zuul community update tonight so need to prep for that but also don't want to work all afternoon if I'm working tonight :)20:52
clarkbI think that gives us ~2 options to address this and people can feel free to update/fix/etc as necessary20:53
clarkbalso I think we need to sort id ascending so that if we get new repos we don't change the ordering21:03
clarkbbut that shouldn't be as big of an issue bceause you'll get dups but still have complete data21:03
fungiyeah, agreed, ascending id sort is more robust than descending if new repos are added while iterating (not that we expect that in production)21:07
clarkband thinking about it more what they really should do is provide a next url that has enough of a seed to reproduce the original list and index into it properly21:09
fungithat probably requires caching some additional state, and then having to decide how long you keep it fo21:09
fungir21:09
clarkbya, I'm hoping this issue I've filed sparks a discussion on getting it right overall21:10
clarkbfor that reason I'm kinda leaning towards our solution being https://review.opendev.org/738109 for now21:11
clarkbbasically accept the pagination is flawed and work around it21:11
clarkbrather than rely on another likely flawed pagination system21:11
clarkbthough I've justed realized that will make creating an initial set of repos very slow. I guess we're about to find out how slow via testing21:15
clarkbfungi: another thought just occured to me. If we clear out new project additions from projects.yaml we could clear out the gitea management temporarily in order to get gerrit things updated21:21
clarkbfungi: that may be worth doing if this prolongs itself due to the pain of dealing with it21:21
clarkboh also just had a thought. We could check for http 409 and ignore those errors21:22
clarkbinstead of doing the GET before hand to see if project exists which will slow down initial load out21:22
fungioh, yep, lbyl is biting us basically, we could just eafp21:23
fungiespecially if the only effective error 409 represents is "already exists"21:24
clarkbya its a conflict which is basically its there you can't do this21:24
clarkbaiui21:24
clarkbok really popping out for a bit now21:25
clarkbI'll roll back in a bit later and see if anyone has a preference of the ~3 options that have been brainstormed21:25
clarkbthe gitea change I wrote doesn't seem to fix it. https://zuul.opendev.org/t/openstack/build/11bade7c1229425a916a04c505ada62e failed. I think that means we should ask permission or forgiveness. For permission https://review.opendev.org/#/c/738109/21:45
clarkbI'm rechecking though since a single data point is insufficient21:45
clarkbmaybe we recheck that continuously thoruhg the end of today and if it continues to work we go with it tomorrow?21:45
clarkbthen I can rebase the other stuff on top of that if we go that route21:45
ianwclarkb: huh, the jury is still out on how to get any logging out of the module?  TypeError: __init__() got an unexpected keyword argument 'log'22:23
ianwi couldn't see anything, other than bunching stuff up to return in the json22:23
clarkbianw: ya that change is broken but I fixed the argument thing and it still didn't work22:23
clarkbianw: I ended up using the built in logging of the module which is really lcuinky but it worked22:23
clarkbianw: and traced it to bugs in gitea pagination so now I'm thinking something like https://review.opendev.org/#/c/738109/ is our best bet or similar to that but asking forgiveness isntead of permission22:24
clarkbI also filed a bug with gitea and psuhed a PR that doesn't seem to be working22:24
clarkbianw: I think we recheck 738109 a few times then if people are happy enough with it we can try and land it and get manage projects running22:25
clarkbmanage projects has a fair bit of backlog now though so not sure you want to be on the hook for that overnight (can wait until tomorrow morning)22:25
clarkband I'ev held some nodes if people want to interact with gitea though I've hacked up the gitea-git-repos role there so may need to restore to known state if expecting that to make sense22:26
ianwso it's not that get_org_repo_list may return duplicates; it's more that it may *also* not return all the projects?22:28
clarkbyes22:29
clarkbbecause pagination order things by timestamp and collisions would not be sorted stably by mariadb22:29
fungibasically, page offsets seem to start and end at the wrong place22:29
clarkbbut also in production that timestamp can update frequently so you'd lose things in the listing that way as well22:30
ianwand we don't want to just probe for projects all the time and ditch the walk, because that makes it much much slower even in CI when we're starting fresh?22:30
clarkbyes however the 109 change above will do it in CI because we start from nothing so check each repo in that case22:31
clarkband it doesnt seem to be too slow22:31
clarkbbut ya checking after listing is an optimization but we could always check amd drop the listing too22:31
ianwyeah, that's my only thought; simplify it by just checking the project directly -- reusing the session it seems like it should be as low overhead as possible22:32
clarkbalso orgs, teams and all that arepaginated too and I would be amazed if they dont have similar issues22:33
clarkbI expect we'll end up doing incrememntal imptovements22:33
clarkbstarting with repos ti make things work with the current dataset22:33
ianwjust as written also makes sense, and is more easily revertable when it's fixed, so no problem with that either really :)22:33
mordredyeah - ultimately this is a fundamental issue in the api that would surely be good to sort out22:33
clarkbmordred: ya I'm hoping my github issue helps gitea move towards the fixing22:34
clarkbbut ya I think maybe start with 109 once we are confident in it then continue to make incremental improvments frkm there22:34
*** tosky has quit IRC22:40
*** mlavalle has quit IRC22:58
clarkb109 has succeded now twice. I've recehcked it again23:02
mordredclarkb, ianw: I left a +2 on 109 but not a +A23:05
mordredbecause - you knlw - it's EOD here23:05
mordredianw: oh - also - we tried rolling out ze* on docker but were missing krb5-user from the images. zuul is stopped on ze01 and they're all in emergency23:06
mordredwe'll try again tomorrow - but just an fyi23:06
ianwcool, yeah i think my best bet is to not touch anything :)23:06
ianwmordred: not sure if you saw but grafana came together well as a container in https://review.opendev.org/#/c/737406/ and https://review.opendev.org/#/c/737397/523:07
ianwi'm going to have a look at graphite and see if it is as amenable ... that would be two more ticked of the xenial list23:08
mordredianw: that looks great!23:12
mordredianw: I +Ad the first, +2d the second23:12
ianwthanks ... i found https://hub.docker.com/r/graphiteapp/docker-graphite-statsd/ yesterday and it looks very promising as pretty much a drop-in23:13
ianwdevil will be in the details23:13
openstackgerritMonty Taylor proposed opendev/system-config master: Make bindep installs non-interactive  https://review.opendev.org/73812123:19
mordredianw, corvus: ^^ if you got a sec23:20
corvus++ all around23:20
mordredianw: the devil is always in the details23:20
mordredcorvus: awesome - thanks23:20
*** DSpider has quit IRC23:22
openstackgerritMerged opendev/system-config master: Add a grafana/grafyaml image  https://review.opendev.org/73739723:33
*** rchurch has quit IRC23:43
*** rchurch has joined #opendev23:43
*** ryohayakawa has joined #opendev23:56
*** cloudnull has quit IRC23:57
*** cloudnull has joined #opendev23:58

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!