Thursday, 2020-09-24

sgwfungi: worked it out, might have not completed all the required steps.  He is added.  BTW, I still had the permission problem from a new machine, new email, same OneID though.  I could do it via CLI00:45
fungimmm... if you can add someone to the group with the ssh cli and not with the webui, that points to your ssh key being associated with a different account than your openid00:49
fungissh key and ssh username i mean00:50
fungisgw: bingo, you have two active accounts with different e-mail addresses, ids 28607 and 32381. the first has a username and the second does not00:57
fungii bet the second is the one you're using via the webui, the first is the one which was in the group00:57
fungilikely the best solution is to move the new openid to the older 28607 account and set 32381 inactive, but you'll be the best judge of which you want to keep01:00
ianwafter all that the templates look unchanged.  however, there will be changes for the next version :)01:13
*** auristor has quit IRC05:25
*** ysandeep|away is now known as ysandeep06:15
AJaegerinfra-root, merged a new repo, but it does not exist, see
AJaegerinfra-root, merged a new repo, but it does not exist, see
AJaeger(now with fixed pasto)07:17
AJaegerinfra-root, and 753816 is *not* in git - check
AJaegerSomehow syncing to gitea looks broken ;(07:18
*** ysandeep is now known as ysandeep|lunch07:59
*** ysandeep|lunch is now known as ysandeep09:21
ianwAJaeger: pretty sure would get it10:47
ianwoh, yeah, it failed @
ianwfatal: []: FAILED! => {10:52
ianw    "changed": true10:52
ianw500 Server Error: Internal Server Error for url: https://localhost:3000/api/v1/orgs/openstack/repos?limit=50&page=210:52
AJaegerianw: I don't see the change mirrored to the gitea farm either, so somehow project-config is not update in opendev.org10:53
AJaegerianw: so, two problems: 1) repo creation failed; 2) No sync from gerrit to gitea for project-config10:54
ianwAJaeger: ok, the replication queue doesn't look like it's stuck on ; i.e. gerrit thinks it is sending out updates11:00
sshnaidmfungi, clarkb hi, I saw something suspicious in logs page:  maybe it's that cause delays while opening logs page, fyi11:18
ianwAJaeger: has it11:33
ianwit looks like gitea04/05 are out of sync11:34
ianwboth have a lot of git defunct processes11:35
ianwAJaeger: sorry, i'm out of time to keep poking now11:35
ianwinfra-root: if i had to suggest, maybe do the point-release upgrade to gitea with and then re-sync 04 & 0511:36
ianwi guess i should take them out of rotation11:38
ianw#status log took gitea 04 and 05 out of the load-balancer because they are out of sync11:40
openstackstatusianw: finished logging11:40
ianwAJaeger: ^ i think you and me are just lucky to get hashed to 04, but it should be good now11:40
ttxalso (less urgent, just signalling in case there is a relationship) ethercalc seems offline (503)11:40
ianwSep 24 10:13:07 ethercalc02 bash[15741]: /opt/ethercalc/node_modules/redis/index.js:60211:43
ianwSep 24 10:13:07 ethercalc02 bash[15741]:                 throw err;11:43
ianwttx: not related, i restarted it, that was the last error ^11:43
ianw#status log restarted ethercalc service on ethercalc.openstack.org11:43
openstackstatusianw: finished logging11:43
ttxthanks ianw11:43
ianwok, gitea02 is up-to-date with project-config, but not starlingx/meta-starlingx11:46
ianwinfra-root: here's the status ->
ianwgite02 is up-to-date with project-config, but doesn't have starlingx/starlingx-meta (401 error?).  gitea04 & 05 out of sync with project-config, and no repo too (500 Server Error: Internal Server Error for url: https://localhost:3000/api/v1/orgs/openstack/repos?limit=50&page=2 from both)11:49
ianwi will also turn off gitea0211:50
ianw#status log disabled gitea02 in the load balancer due to missing projects (see previous note about 04 and 05)11:50
openstackstatusianw: finished logging11:50
ianwso 02/04/05 are off in the LB pending further investigation11:50
ianwi think everything should be ok for now; i have to head off but will jump on early tomorrow11:51
AJaegerianw: thanks12:04
AJaegerianw: good night!12:04
sparkp1ughey everyone, I'm a new graduate with intermediate knowledge about linux systems and am quite interested in open source contribution12:06
sparkp1ugI don't know where should I begin with and any guidance would be helpful12:06
roman_gsparkp1ug Hi. Depending on your skills and interests, you could join pretty much any open source project.13:28
roman_gFind one project which is being developed in languages you know, and ask developers on IRC/mailing list/Slack/etc. for low-hanging-fruits for you to get you started.13:29
*** ysandeep is now known as ysandeep|mtg14:02
*** ysandeep|mtg is now known as ysandeep14:25
clarkbcatching up on the gitea stuff. It sounds like 02, 04, 05 didn't create the new starlingx project? and they also are out of sync with git contents?14:56
clarkbthere was OOMing on all three at that time14:57
clarkbwhat I'm confused about is how we ran the gerrit project creation if gitea failed since gitea happens first?14:58
clarkbAJaeger: do you have a link to the chagne that created the new project?14:58
clarkbI want to work backward from that and see how it broke14:58
clarkbfungi: I just found the logs ansible definitely failed15:03
clarkbianw has suggested we upgrade gitea first, but I think we should get gitea to a happy spot that way we don't have to debug upgrade problems and project creation problems at the same time if the upgrade has a sad15:05
clarkbfungi: since gerrit succeeded at manage projects I think what we want to do is run a modified manage-projects.yaml playbook that only does the gitea bits, then limit it to gitea02,04,0515:06
clarkbfungi: run it then assuming that works now we reboot them to clear out any oom side effects and then tell gerrit to resync them15:06
clarkbfungi: I'll start a root screen on bridge15:07
AJaegerclarkb, 753816 was the change15:07
clarkbI'm worried the max_fail_percentage: 1 we have in manage-projects.yaml is not doing what we think it is doing and that is why gerrit ran after the failures15:11
clarkbbut now that we have the log we can figure that out later15:11
clarkbfungi: does that command look correct to you in the bridge screen?15:13
clarkbI'll run disable-ansible prior to running the ansible command too15:14
clarkb`ansible-playbook /home/zuul/src/` <- is the command for anyone else looking15:15
clarkbthen can run that for 04 and 0515:16
fungioh, sorry, split between multiple discussions this morning. jumping into your screen session now15:17
fungiclarkb: yep, lgtm, let's give it a shot15:18
clarkbok disabling-ansible now then will do that15:19
fungidid a change to push description updates to gitea during those job runs merge yet? i forget where you got to in the meeting i skipped15:19
fungijust wondering if it could be related, or hasn't gone in yet15:19
clarkbfungi: not yet
clarkbok that says it changed 02 /me checks 02 directly15:21
fungiokay, that simplifies the investigation i guess, but also takes away one obvious suspect15:21
clarkbI think the OOMs line up reasonably well15:21
clarkbI'm guessing gitea was unhappy at the time15:21
clarkb exists15:22
clarkbrunning the command above for 04 and 05 now15:22
clarkbthen I'll reboot them then I'll trigger gerrit project sync15:22
fungidoes make me wonder what about that project creation would have pushed them over the edge15:24
clarkb and look good now too15:24
clarkbfungi: I'm guessing something else had increased memory pressure (we aren't under another ddos are we?)15:25
clarkbI'm rebooting 02 04 05 next15:25
fungiyeah, lgtm15:25
clarkbI've reenable ansible since the ansible runs are all done now15:27
clarkbthere are now ~6.5k tasks replicating to gitea15:29
clarkbonce I'm happy with 02 et al and have them back in the lb I'll quickly check the other 5 servers too15:29
clarkbsparkp1ug: as roman_g mentions finding something that interests you or overlaps with existing knowledge is a good place to start. I'm happy to help with suggestions if you have any interests you're willing to share15:32
clarkbI think instead of max_failure_percentage: 1 we may want any_errors_fatal?15:33
clarkbI wonder if the percentage is calculated against our entire inventory and not just the matching nodes?15:33
clarkbin which case 3 servers out of whatever the total number is may not trip the percentage?15:34
roman_gsparkp1ug if you don't possess development skills, there is still always lots of work to do to 1) improve CI tests, 2) improve CD procedures, 3) improve documentation, 4) work on translations, 5) help others answering questions on project's communication channels, etc.15:35
* clarkb finds breakfast while gerrit replicates15:36
fungiroman_g: sparkp1ug: well, and also if you don't possess development skills, that's no reason not to try your hand at software development anyway. there's no better way to learn, in my opinion15:41
fungi(assuming gaining those skills is an interest you have)15:42
roman_gAbsolutely true.15:42
* roman_g is looking for Go mentor15:42
fungiit's how i learned, or continue to learn really15:42
fungii still consider myself mostly a sysadmin. as a dev i'm a bit of a hack even after all these years15:43
roman_gSame for me. I was sending patches for C code to the mailing list of a project, but I was nowhere near C programming.15:45
roman_gI just can read and type in English...15:45
clarkbAJaeger: does look up to date to you now? (I think its happy on all 3 hosts but the sync isn't finished yet)15:59
clarkbif anyone is wondering look for MAINT vs UP to see the disabled servers in haproxy show stat output16:08
clarkbthe other servers all have recetn OOMs too16:09
clarkbI'll rotate through those with restarts and replication once 02 04 and 05 are done16:09
clarkbthat way we can get everythign in a happy place before we do the 1.12.4 upgrade16:10
clarkb1k tasks left on 02 04 0516:10
clarkbfungi: re the description updates, we should check the job logs and look at cost to do that timewise. We only run the playbook when projects.yaml changes so its impact should be low but if it is very slow then maybe we reconsider it16:13
*** mlavalle has joined #opendev16:15
clarkb2,4,5 are back in the rotation. I've pulled 1 and 3 out so I can reboot them post OOM and then will replicate and add them back16:19
clarkbgitea's indexer startup timeout is still 30s despite the config chagne to bump that to 300s16:26
clarkbI'll look into that too I guess16:26
clarkbI think it is a type issue. The value needs to be 300s not 30016:32
fungioh, neat. unit is not implied?16:33
clarkbI guess not16:34
clarkbI think we should land that before the upgrade too since the upgarde will do restarts16:34
fungii concur16:35
clarkbI think that means our upgrade process should roughly be: reboot and replicate all servers (in progress now), bump startup timeout to 300s properly, upgrade to 1.12.416:35
fungisounds right to me16:37
clarkbthen after the upgrade we can look at landing the description update change16:40
clarkb08 is replicating now and is the last one17:47
clarkbshould be done in ~20 minutes?17:47
fungicool, i'll be around but also in sb meeting17:55
fungiand then i need to start the grill for dinner and knock out some mowing while weather favors it17:55
clarkb08 is back in the rotation now18:04
clarkbI've +2'd the gitea upgrade change. I looked at the diffs quickly too and they are clean from what i can see (expected for minor updates like that)18:04
clarkbstartup timeout change should land in about half an hour18:05
clarkbI'm going to look at how the playbook continued on to review.o.o during manage-projects more now as I think we're stable and just awiting for changes to alnd on other things18:06
* clarkb does local ansible testing18:06
clarkbthe use of max_fail_percentage does seem to change how error handling happens locally but I can't get it to run a subsequent play (maybe beacuse I need to have many more hosts in my inventory18:13
clarkbinterestingly that max_fail_percentage code comes from the old cgit ansible puppettry18:13
clarkbnow I'm wondering if it is an ansible version specific thing18:17
AJaegerclarkb: gitea05 looks fine - thanks18:17
AJaegerconfig-core, please review - our lint job was broken ;(18:18
clarkboh its a bug in our python I think18:20
clarkbare we not setting hte appropriate failure flags?18:21
clarkbthis is all very confusing /me reviews AJaeger's link18:21
clarkbAJaeger: I guess that means tox fixed their string interpolation issues18:22
AJaegerclarkb: Ah! I wondered already ...18:24
clarkbAJaeger: ya at some point we had to add the extra % to esacpe because tox wasn't doing it for us18:25
AJaegerI see... Now it's broken and I only noticed since I looked at the logs ;(18:26
AJaegerdonnyd: can we use openedge for logs again? AFAIU we're ready - config-core, please approve
AJaegerconfig-core, please review and as well18:29
openstackgerritClark Boylan proposed opendev/system-config master: Simplify gitea project creation control flow
clarkbI think ^ may be the next step in better undstanding that weird ansible behavior. Basically simplify and go from there18:31
clarkbI'm unable to reproduce the issue locally18:31
clarkbAJaeger: I think donnyd wanted to get better monitoring installed first18:31
donnydclarkb: we can turn it back on, but if it breaks again I still won't know why19:09
donnydit shouldn't break because I pinned the packages on the rgw nodes and took them out of the automation for updates19:09
*** roman_g has joined #opendev19:32
clarkbfungi: ianw I should be around for the next hour and a half or so if we want to land nowish, but then I plan to get out on a bike ride (happy for other people to approve it while I'm doing that too)19:41
clarkbianw: fungi may be another good one to get in after the issues we had previously. I think that may make it easier to understand ansible issues in gitea land19:41
fungiyeah, i'm here, just mowing and getting dinner going19:43
fungiapproved just now19:43
clarkbdonnyd: cool, I'll probably approve that change once gitea is done upgrading and I'm back from my bike ride20:08
*** roman_g has quit IRC20:16
clarkbgitea upgrades should beging in about 4 minutes or so if I'm reading the zuul status page properly20:50
* fungi is around, just flipping burgers (literally)20:56
clarkbgitea01 is about to be upgraded, pulling the new image now20:56
clarkb(I'm tailing the log)20:56
clarkband its been running for longer than the old 30 second startup timeout20:57
clarkb(looks good for that fix)20:57
clarkb looks good as does my navigation to that page20:58
clarkbthrough 04 are done now and they all look good so far21:02
clarkb"The value 1000 (type int) in a string field was converted to '1000' (type string). If this does not look like what you expect, quote the entire value to ensure it does not change." <- I wish it were clearer what types ansible expects where. Anyway we may want to update our ansible to use strings for uid and gids to avoid that warning21:07
clarkbjustsomething I noticed, everything seems fine21:07
clarkbgitea has been upgraded to 1.12.4 across the whole cluster. I don't see an issues yet21:10
clarkb#status log Upgraded gitea to version 1.12.4 from 1.12.321:10
openstackstatusclarkb: finished logging21:10
clarkbif anyone notices oddities please let us know21:10
clarkbI'll keep an eye out as I get ready for bike ride in case something pop up in the next 10 minutes21:11
fungii'll try to continue keeping one eye on irc21:18
ianwhey, looks like i came in right as everything is done :)21:20
fungiyou, sir, have impeccable timing21:56
ianwfungi: interested if you have any suggestions for the 3081 gitea proxy in ... ssl cert issues.  in the gate, we proxy to localhost that works because we have a self-signed cert.  in production that fails because the cert doesn't cover localhost22:16
ianwseems we could set SSLProxyCheckPeerName off to just ignore it always, or, swizzle the ProxyPass differently for testing v production22:17
fungiianw: we could add an /etc/hosts entry and not refer to localhost?22:17
ianwumm, i think the cert we make still won't cover it22:18
fungii guess you want a non-host-specific reference though22:18
ianwin testing22:18
ianwyou get a 502 error22:18
fungioh, the service listening there cares what http/1.1 hostname header is passed?22:19
ianwso 3081 is the apache proxy we put in for potential user-agent filtering, etc that just proxies through to 3000.  i tried to use it last night and noticed it was giving 502 errors22:21
ianwoh, i guess you're saying "the port 3000 service (gitea) cares about hostname header"22:22
ianwi'm pretty sure port 3000 in the container is a ngnix, that is connected up to our letsencrypt certs on disk, so yeah, it gets unhappy22:22
ianwmaybe i'm thinking of something else.  i guess it's all inside gitea actually22:23
fungiyeah, i don't think there's any nginx in there, just some go-based http listener22:26
ianwyeah i'm thinking of graphite :)22:26
*** sparkp1ug has quit IRC22:38
ianwthat switches in the gate, see how that goes22:39
clarkbianw: fungi it might also be nice to land the descriptions upfate change if people areup to it23:04
clarkbI'll be back at a keyboard in a few23:04
ianwyeah that lgtm23:08
clarkbianw: looks like you saw the simplification on the ansible side for gitea too23:17
ianwyeah, agree with that23:18
clarkbmy best guess is some combo of strategy: free and max_fail_percentage: 1 is why gerrit ran in that failed manage projects run23:18
clarkbok I'm going to approve the openedge logs storage change now23:20
clarkbpromised to do that after my bike ride23:20
clarkbianw: btw your apache change for gitea failed, but I didn't dig into logs yet (there were a number of other issues today :) I was just looking to see if that was necessary to fix gitea things)23:24
ianwclarkb: i've sent a v2 for that which is still running now; basically swizzle the hostname for testing v production23:26
ianwa bit annoying because it passed testing, but then we never switched it in live to notice the production issue23:27
ianwbut i still think it's a good escape hatch23:27
clarkbdonnyd: ^ fyi I'll keep an eye on it too23:28
ianwclarkb: and yep, it just reported and looks like it passed the gate at least :)
ianwwhat exactly does the javascript publishing publish anyway?  clearly nobody has noticed it being old23:42
clarkbianw: in the before docker container times opendev deployed zuul dashboard with that js23:43
clarkbat some point we switched to the bubdled js in the zuul packaging23:43
ianwzuul-master-py3-none-any.whl2020-02-20 05:23 11M23:43
clarkband the docker container images use that too23:43
clarkbbasically ist there for people to deploy the js without installing zuul itself aiui23:43
ianwthat doesn't seem to be updating either?23:44
clarkbbut ya clearly it not being updated hasn't been missed, maybe we should consider cleaning it up?23:44
ianwdo we not publish zuul release tarballs to tarballs?23:48
clarkbianw: not sdists I don't think23:49
funginot since the switch to a dedicated zuul tenant i think23:49
ianwhrm, ok, so they only go to pypi?23:49
fungiafaik, yes23:49
clarkband dockerhub via the docker images23:49
ianwwe should probably rm to avoid confusion23:49
clarkbbefore we do that we should ensure this isn't an oversight corvus would prefer we fix23:50
clarkb(but it seems that no one is using that content so maybe cleaning it up is the way to go)23:50
ianwright; it uses which only updates to pypi23:52

