2019 talk:Technology outreach & innovation/URL shortener: How something so simple can be so complicated

From Wikimania
Some partial notes on this presentation
  • Extension:ShortUrl was developed starting in May 2011
  • an RFC in Nov 2012 https://w.wiki/9
  • The work got stalled on an abuse protection issue in Dec 2016, see https://w.wiki/6vS
  • Wikidata team at WMDE took over and deployed it in March-April 2019
  • then handed back to WMF for maintenance
some internals to a URL shortener
  • there are three layers to such a thing: Apache + Varnish cache in the front end ; Extension:UrlShortener; and database
    • the database has a column with the various shortened codes and on the same row another column with what they unpack to
  • the characters in a shortened URL often avoid 0 and o because they can be mistaken for one another; also 1 is avoided it seems ; they can include other digits, upper and lower case English, and a $ character
  • Complications: cache; domain whitelist; abuse prevention ; rate limit (DDoS protection) ; TLS/SSL + domain registration ; dumps ; URL shortening functionality ; size limit ; normalizing ; twitter ; "Easter eggs"
    • Abuse example: hiding the URL to spam/advertising inside a shortened URL and then posting the shortened URL so as to evade detection of the blacklisted URL, or see example here: https://w.wiki/6vS
    • Easter eggs might be strings found in the shortened URL
  • For more, see https://meta.wikimedia.org/wiki/Wikimedia_URL_Shortener