19:03:35 #startmeeting infra 19:03:36 i guess we could talk about etherpad lower-casing, but that's not especially urgent 19:03:37 Meeting started Tue Jun 2 19:03:35 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:03:38 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:03:40 The meeting name has been set to 'infra' 19:03:47 if nothing else we'll record that we had nothing to say 19:04:02 #topic Announcements 19:04:15 This week the PTG is happening so we are a bit distracted 19:04:26 for that reason we'll have a shorter less formal meeting 19:04:30 #topic Open Discussion 19:06:06 Meetpad's room urls are case insensitive due to xmpp limitations 19:06:22 this has created a small amount of confusion with the mapping onto etherpad urls as ehterpad urls are case sensitive 19:06:39 fungi looked at the database and it looks like people get confused by that all the time 19:06:40 o/ 19:06:49 the workaround we are using is to use lower case urls in both and renaming pads if necessary 19:07:00 ahh, yeah, so i did a bit of analysis on what it might look like if we wanted to lower-case all pad names (and presumably set up redirects) 19:07:25 so it seems like making etherpad case-insensitive in general would solve this jitsi issue for the future as well as prevent some etherpad-only mistakes 19:07:42 we have a bit over 2k pads (out of roughly 60k) which would need to have case-insensitivity collisions resolved 19:08:14 however, etherpad also has a great feature where if anyone connects to a new pad it saves that initial revision with just the intro text 19:08:28 i probably am responsible for 5k empty pads :) 19:08:43 so i did some comparisons of checksums of all the pad contents and found that if we removed those and also pads which are blank, that leaves more like 500 we'd need to look through 19:09:05 still a lot, but not insurmountable 19:09:41 brainstorming how to resolve collisions: we could rename one of them to something with a suffix (eg "-case") and ... if it's not too hard to edit via the api... prepend a note at the top saying "if you're looking for $otherpadname it has been moved to $newurl" ? 19:09:46 if nothnig else, we might take the checksum comparisons as a good opportunity to clean up the db. roughly haf of our pads are blank or contain just the intro text 19:10:12 fungi: if we cleanup the blank pads, we could cron that weekly or something too 19:10:26 yeah, well, intro text only pads for sure 19:10:35 we can delete pads via the api right? 19:10:42 so that bit at least should be straightforward 19:11:01 i'm not 100% sure removing blank pads is a good diea, because it's possible someone blanked them in a fit of vandalism, and if we delete them then we can't get them back (excepting from our database backups) 19:11:19 but there's fewer of those 19:11:27 sorry i meant intro 19:11:34 the ones which are all intro text are obviously fair game for sure 19:11:51 i also found a surprising number which are intro text plus an abiword error message 19:12:11 also we have something like 5 variations of intro text floating around as it's changed over time 19:12:32 there's an "appendText()" method, as well as "setText()"... so adding a redirect message seems plausible 19:13:09 but basically anything over 20 identical checksums (after stripping leading/trailing whitespace) is trash, i verified the texts manually 19:13:10 i'm not sure how that deals with formatting (perhaps the setHTML() method would be necessary?) but it's at least something to look into 19:14:11 any other brainstorms about how to resolve the conflicts? 19:14:37 as for the actually empty pads, i could probably do a bit of analysis on revision count. most of them probably just deleted the intro text and that was it 19:15:24 i mean, chances are a lot of the remaining 500 name collisions are also trash, i just haven't had time to take a look 19:16:25 yeah, but if more than like 20 of them are real, it may be easier to automate the whole thing 19:16:43 (after all, if we automate it, and nobody notices, it's no big deal :) 19:17:17 corvus: other ideas if they are the result of people mistakenly using case improperly we could merge them somehow and keep the lower case version going forward 19:17:34 if they are distinct then your idea seems sane 19:17:47 like concat them? yeah 19:18:29 oh, i guess fungi implied an option that i didn't quite pick up on too -- 19:18:40 ya concat is probably simplest 19:18:44 rename one pad, and add an .htaccess entry for that one 19:18:50 (that does a redir) 19:19:25 that would work for people visiting etherpad.o.o directly, but wouldn't address confusion for folks arriving via meetpad 19:20:00 right I think we want to force teherpad to do lower case too? 19:20:13 at least that was what I was assuming we wanted then it would avoid confusion there and mismatched behavior with jitsi 19:20:22 yeah, i wondered if we should make etherpad just redirect to lower-case padnames (if that's possible) 19:20:24 can we ensure that future new pads are all lowercase? 19:20:32 yeah, i think in all cases, we have etherpad redirect to lower case 19:21:01 that avoid people creating new problem pads 19:21:11 we can enforce that with apache 19:21:16 (I think) 19:21:49 then the question is for conflicts, do we a) move and add a note to the pad (optional: add a specific redirect for the moved pad); b) concat. 19:22:01 clarkb: yeah, would be a simple mod_rewrite redirect 19:22:50 fungi: do you have a list of collissions? 19:22:57 yep! 19:23:16 i didn't post it publicly since i don't know if anyone was relying on some random pad names to not be discoverable 19:23:50 if we remain concerned about that, that may eliminate the idea of having a .htaccess list for specific pad name redirects 19:24:02 it's also just a python script i can rerun to regenerate, but takes around an hour due to the number of queries 19:24:22 fungi: ~fungi/collisions.yaml ? 19:24:53 checking, but if that's got 504 entries then yes 19:25:23 yeah, that looks like it 19:25:44 that's the collisions which would remain if we cleaned up empty and intro text pads 19:25:49 ah nice, there's some linkfarm spam there 19:28:41 anyway, just wanted to strike up the discussion when it wasn't a weekend 19:29:07 scripting stuff against the etherpad rest api is not hard, and it's well-documented 19:29:27 so we could certainly consider periodic cleanup by checksum, for example 19:29:31 spot checking these, i feel pretty confident that only one of the two of each of these is going to be important 19:29:52 so far, they're either both linkspam, or one was clearly "the wrong one" 19:29:56 that's my suspicion as well, they just weren't going to be as easy to mechanically identify 19:30:14 cool in that case we should be able to delete the bad one, then rename if necessary, and set up redirects in apache? 19:30:16 we could use a simple rubric to determine the "better" of the two to rename 19:30:24 so my hope is that we wouldn't need any fancy per-pad redirects or breadcrumbs 19:30:37 well, if we go through all 500 -- do we want to? 19:31:19 i guess if we got a bunch of folks doing it, we could probably knock it out pretty quickly 19:31:20 i suppose we could do it in batches (the rest api would still let us get to the redirected originals 19:31:46 so we could still set up the mass redirect and renaming while we worked through the collisions 19:32:10 but we'd presumably want to go through the cleanup first, at least before bulk moving 19:32:46 here's what i'm thinking: if we want to delete one of the pads (or rename it to a non-public name), we'll need to go through manually and identify which one to keep. but if, instead, we went with one of the options above (concat or rename with link) we could use a simple length rubric to 'guess' which is the best one, so we can make that the one that people land on by default. essentially, 19:32:46 rename that to be the lowercase one if it isn't already. 19:33:29 for link spam we might be able to identify those based on content? eg just a list of urls? 19:33:46 probably so? that might prune it a bit more 19:33:58 plan sounds reasonable, also yes spotting pads which are just lists of urls may also be scriptable 19:34:20 also, some of the linkspam may actually have identical content 19:35:59 yeah, some does, you can find the checksum analyses in checksums.yaml 19:36:16 okay, maybe we can putz with this the rest of this week and see if we can prune the list a bit, then send an email out next week with a suggested plan 19:36:43 wfm 19:37:04 i at least wanted to be sure this was something we felt we ought to do 19:37:31 ya it seems doable 19:37:44 and considering people were having collision issues previously seems like a good idea meetpad or not 19:38:38 a bunch of the url-heavy examples i'm looking at don't seem to be linkfarms for search engine purposes 19:38:59 they instead seem to be mazes of link obfuscation and url proxies 19:40:00 yeah, i was a little ambivalent about doing it just to "fix" meetpad, but i'm increasingly convinced it's a Good Idea 19:40:24 fungi: yeah, there's some interesting stuff in there, the purpose of which i don't fully understand 19:41:48 aww, i just found someone's gerrit http password :/ 19:49:07 clarkb: i think we may be out of topics :) 19:49:17 agreed 19:49:27 #endmeeting