19:03:35 <clarkb> #startmeeting infra
19:03:36 <fungi> i guess we could talk about etherpad lower-casing, but that's not especially urgent
19:03:37 <openstack> Meeting started Tue Jun  2 19:03:35 2020 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:03:38 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:03:40 <openstack> The meeting name has been set to 'infra'
19:03:47 <clarkb> if nothing else we'll record that we had nothing to say
19:04:02 <clarkb> #topic Announcements
19:04:15 <clarkb> This week the PTG is happening so we are a bit distracted
19:04:26 <clarkb> for that reason we'll have a shorter less formal meeting
19:04:30 <clarkb> #topic Open Discussion
19:06:06 <clarkb> Meetpad's room urls are case insensitive due to xmpp limitations
19:06:22 <clarkb> this has created a small amount of confusion with the mapping onto etherpad urls as ehterpad urls are case sensitive
19:06:39 <corvus> fungi looked at the database and it looks like people get confused by that all the time
19:06:40 <AJaeger> o/
19:06:49 <clarkb> the workaround we are using is to use lower case urls in both and renaming pads if necessary
19:07:00 <fungi> ahh, yeah, so i did a bit of analysis on what it might look like if we wanted to lower-case all pad names (and presumably set up redirects)
19:07:25 <corvus> so it seems like making etherpad case-insensitive in general would solve this jitsi issue for the future as well as prevent some etherpad-only mistakes
19:07:42 <fungi> we have a bit over 2k pads (out of roughly 60k) which would need to have case-insensitivity collisions resolved
19:08:14 <fungi> however, etherpad also has a great feature where if anyone connects to a new pad it saves that initial revision with just the intro text
19:08:28 <corvus> i probably am responsible for 5k empty pads :)
19:08:43 <fungi> so i did some comparisons of checksums of all the pad contents and found that if we removed those and also pads which are blank, that leaves more like 500 we'd need to look through
19:09:05 <fungi> still a lot, but not insurmountable
19:09:41 <corvus> brainstorming how to resolve collisions: we could rename one of them to something with a suffix (eg "-case") and ... if it's not too hard to edit via the api... prepend a note at the top saying "if you're looking for $otherpadname it has been moved to $newurl" ?
19:09:46 <fungi> if nothnig else, we might take the checksum comparisons as a good opportunity to clean up the db. roughly haf of our pads are blank or contain just the intro text
19:10:12 <corvus> fungi: if we cleanup the blank pads, we could cron that weekly or something too
19:10:26 <fungi> yeah, well, intro text only pads for sure
19:10:35 <clarkb> we can delete pads via the api right?
19:10:42 <clarkb> so that bit at least should be straightforward
19:11:01 <fungi> i'm not 100% sure removing blank pads is a good diea, because it's possible someone blanked them in a fit of vandalism, and if we delete them then we can't get them back (excepting from our database backups)
19:11:19 <fungi> but there's fewer of those
19:11:27 <corvus> sorry i meant intro
19:11:34 <fungi> the ones which are all intro text are obviously fair game for sure
19:11:51 <fungi> i also found a surprising number which are intro text plus an abiword error message
19:12:11 <fungi> also we have something like 5 variations of intro text floating around as it's changed over time
19:12:32 <corvus> there's an "appendText()" method, as well as "setText()"... so adding a redirect message seems plausible
19:13:09 <fungi> but basically anything over 20 identical checksums (after stripping leading/trailing whitespace) is trash, i verified the texts manually
19:13:10 <corvus> i'm not sure how that deals with formatting (perhaps the setHTML() method would be necessary?) but it's at least something to look into
19:14:11 <corvus> any other brainstorms about how to resolve the conflicts?
19:14:37 <fungi> as for the actually empty pads, i could probably do a bit of analysis on revision count. most of them probably just deleted the intro text and that was it
19:15:24 <fungi> i mean, chances are a lot of the remaining 500 name collisions are also trash, i just haven't had time to take a look
19:16:25 <corvus> yeah, but if more than like 20 of them are real, it may be easier to automate the whole thing
19:16:43 <corvus> (after all, if we automate it, and nobody notices, it's no big deal :)
19:17:17 <clarkb> corvus: other ideas if they are the result of people mistakenly using case improperly we could merge them somehow and keep the lower case version going forward
19:17:34 <clarkb> if they are distinct then your idea seems sane
19:17:47 <corvus> like concat them?  yeah
19:18:29 <corvus> oh, i guess fungi implied an option that i didn't quite pick up on too --
19:18:40 <clarkb> ya concat is probably simplest
19:18:44 <corvus> rename one pad, and add an .htaccess entry for that one
19:18:50 <corvus> (that does a redir)
19:19:25 <corvus> that would work for people visiting etherpad.o.o directly, but wouldn't address confusion for folks arriving via meetpad
19:20:00 <clarkb> right I think we want to force teherpad to do lower case too?
19:20:13 <clarkb> at least that was what I was assuming we wanted then it would avoid confusion there and mismatched behavior with jitsi
19:20:22 <fungi> yeah, i wondered if we should make etherpad just redirect to lower-case padnames (if that's possible)
19:20:24 <AJaeger> can we ensure that future new pads are all lowercase?
19:20:32 <corvus> yeah, i think in all cases, we have etherpad redirect to lower case
19:21:01 <fungi> that avoid people creating new problem pads
19:21:11 <clarkb> we can enforce that with apache
19:21:16 <clarkb> (I think)
19:21:49 <corvus> then the question is for conflicts, do we a) move and add a note to the pad (optional: add a specific redirect for the moved pad); b) concat.
19:22:01 <corvus> clarkb: yeah, would be a simple mod_rewrite redirect
19:22:50 <corvus> fungi: do you have a list of collissions?
19:22:57 <fungi> yep!
19:23:16 <fungi> i didn't post it publicly since i don't know if anyone was relying on some random pad names to not be discoverable
19:23:50 <corvus> if we remain concerned about that, that may eliminate the idea of having a .htaccess list for specific pad name redirects
19:24:02 <fungi> it's also just a python script i can rerun to regenerate, but takes around an hour due to the number of queries
19:24:22 <corvus> fungi: ~fungi/collisions.yaml ?
19:24:53 <fungi> checking, but if that's got 504 entries then yes
19:25:23 <fungi> yeah, that looks like it
19:25:44 <fungi> that's the collisions which would remain if we cleaned up empty and intro text pads
19:25:49 <corvus> ah nice, there's some linkfarm spam there
19:28:41 <fungi> anyway, just wanted to strike up the discussion when it wasn't a weekend
19:29:07 <fungi> scripting stuff against the etherpad rest api is not hard, and it's well-documented
19:29:27 <fungi> so we could certainly consider periodic cleanup by checksum, for example
19:29:31 <corvus> spot checking these, i feel pretty confident that only one of the two of each of these is going to be important
19:29:52 <corvus> so far, they're either both linkspam, or one was clearly "the wrong one"
19:29:56 <fungi> that's my suspicion as well, they just weren't going to be as easy to mechanically identify
19:30:14 <clarkb> cool in that case we should be able to delete the bad one, then rename if necessary, and set up redirects in apache?
19:30:16 <corvus> we could use a simple rubric to determine the "better" of the two to rename
19:30:24 <fungi> so my hope is that we wouldn't need any fancy per-pad redirects or breadcrumbs
19:30:37 <corvus> well, if we go through all 500 -- do we want to?
19:31:19 <corvus> i guess if we got a bunch of folks doing it, we could probably knock it out pretty quickly
19:31:20 <fungi> i suppose we could do it in batches (the rest api would still let us get to the redirected originals
19:31:46 <fungi> so we could still set up the mass redirect and renaming while we worked through the collisions
19:32:10 <fungi> but we'd presumably want to go through the cleanup first, at least before bulk moving
19:32:46 <corvus> here's what i'm thinking: if we want to delete one of the pads (or rename it to a non-public name), we'll need to go through manually and identify which one to keep.  but if, instead, we went with one of the options above (concat or rename with link) we could use a simple length rubric to 'guess' which is the best one, so we can make that the one that people land on by default.  essentially,
19:32:46 <corvus> rename that to be the lowercase one if it isn't already.
19:33:29 <clarkb> for link spam we might be able to identify those based on content? eg just a list of urls?
19:33:46 <corvus> probably so?  that might prune it a bit more
19:33:58 <fungi> plan sounds reasonable, also yes spotting pads which are just lists of urls may also be scriptable
19:34:20 <corvus> also, some of the linkspam may actually have identical content
19:35:59 <fungi> yeah, some does, you can find the checksum analyses in checksums.yaml
19:36:16 <corvus> okay, maybe we can putz with this the rest of this week and see if we can prune the list a bit, then send an email out next week with a suggested plan
19:36:43 <fungi> wfm
19:37:04 <fungi> i at least wanted to be sure this was something we felt we ought to do
19:37:31 <clarkb> ya it seems doable
19:37:44 <clarkb> and considering people were having collision issues previously seems like a good idea meetpad or not
19:38:38 <fungi> a bunch of the url-heavy examples i'm looking at don't seem to be linkfarms for search engine purposes
19:38:59 <fungi> they instead seem to be mazes of link obfuscation and url proxies
19:40:00 <corvus> yeah, i was a little ambivalent about doing it just to "fix" meetpad, but i'm increasingly convinced it's a Good Idea
19:40:24 <corvus> fungi: yeah, there's some interesting stuff in there, the purpose of which i don't fully understand
19:41:48 <corvus> aww, i just found someone's gerrit http password :/
19:49:07 <corvus> clarkb: i think we may be out of topics :)
19:49:17 <clarkb> agreed
19:49:27 <clarkb> #endmeeting