URL Regular Expression & JavaScript Link Shortener

Detecting URLs in text strings is something that you will probably need sometime when parsing some text in your applications, and their shortening can also be quite handy (like Twitter does for example).

Matching URLs is not so difficult, but matching them and grouping all URL fragments is. Especially if you want to do that in one match. So, here is one URL Regular Expression to rule them all.

RegEx

Regular expression (or RegEx) is a form of a parsing language which purpose is to perform the matching of particular words, patterns or characters within the text string.

Someone once said, RegEx is like a language of its own. At first it may look like bunch of random characters, but actually it’s a pretty useful technique for both simple and complex matching of whatever you need within text string. It is supported by all major programming languages (PHP, Perl, JavaScript, Java, .NET, etc.)

URL

We all know how URL looks like. Here you can see URL anathomy (along with good SEO practices), or just search across the web for it.

URL Regular Expression

In order to match and group all the fragments of URL, and all complex situations of different URL variations, our RegEx is a little-bit longer (324 chars), but it captures and groups all parts nicely (tested it in PHP and JavaScript).

/* URL-RegEx 1.2 */

\(?(?:(http|https|ftp):\/\/)?(?:((?:[^\W\s]|\.|-|[:]{1})+)@{1})?((?:www.)?(?:[^\W\s]|\.|-)+[\.][^\W\s]{2,4}|localhost(?=\/)|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(?::(\d*))?([\/]?[^\s\?]*[\/]{1})*(?:\/?([^\s\n\?\[\]\{\}\#]*(?:(?=\.)){1}|[^\s\n\?\[\]\{\}\.\#]*)?([\.]{1}[^\s\?\#]*)?)?(?:\?{1}([^\s\n\#\[\]]*))?([\#][^\s\n]*)?\)?

It captures 8 groups (plus zero group that contains entire URL). If some of them don’t exist in URL, that group will return empty.

  1. Entire URL – url being parsed
  2. Protocol – http, https, ftp
  3. Userinfo – username:password
  4. Domain – www.mydomain.com, mydomain.com, 127.0.0.1, localhost…
  5. Port – 80
  6. Path / Folders – /folder/dir/
  7. Page / Filename – eg. index
  8. File extension – .html, .php…
  9. Query – item=value&item2=value2
  10. Anchor – #home

You can see it in action, and also test your own URLs:
Test RegEx
There is just one little catch. Because brackets () are allowed in URLs, lets imagine someone put the URL inside brackets that are not part of the URL.

Our blog (http://someweblog.com) is awesome, isn't it?

So, in order not to mix up some bracket that is not part of the URL, we are capturing the brackets before and after URL as well, and all you have to do after matching is to check if the both brackets exist and remove them. Something like this (in JavaScript):

if ( str.charAt(0) == '(' && str.charAt( str.length-1 ) == ')' ) {
    str = str.slice(1,-1);
}

How does it work?

If you really want to know how, or have some uncertainties, just write below in comments. I just couldn’t get myself to write this right now, but I will if someone is interested.

JavaScript link shortener

Now here comes the fun part, once we’ve matched an URL, we can do with it whatever we like. For example, you no longer have to worry about long links messing up your text. Here is Twitter like way of doing it:

var shortenUrl = function(url,protocol,host,port,path,filename,ext,query,fragment) {
    // set url length limit
    var limit = 20,
	show_www = false;
    // remove brackets if URL inside them
    if ( url.charAt(0) == '(' && url.charAt( url.length-1 ) == ')' ) {
        url = url.slice(1,-1);
    }
    // add protocol if doesn't exist
    if ( !protocol ) {
        url = 'http://' + url;
    }
    // create new url to show
    var domain = show_www ? host : host.replace(/www\./gi, '');
    var visibleUrl = domain + (path || '/') + (filename || '') + (ext || '') + (query ? '?'+query : '') + (fragment || '');
    // shorten URL if bigger than limit
    if ( visibleUrl.length > limit && domain.length < limit ) {
        visibleUrl = visibleUrl.slice(0, domain.length + (limit - domain.length)) + '...';
    }
    return '' + visibleUrl + '';
};

// our URL RegRx
var urlRegex = /\(?(?:(http|https|ftp):\/\/)?(?:((?:[^\W\s]|\.|-|[:]{1})+)@{1})?((?:www.)?(?:[^\W\s]|\.|-)+[\.][^\W\s]{2,4}|localhost(?=\/)|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(?::(\d*))?([\/]?[^\s\?]*[\/]{1})*(?:\/?([^\s\n\?\[\]\{\}\#]*(?:(?=\.)){1}|[^\s\n\?\[\]\{\}\.\#]*)?([\.]{1}[^\s\?\#]*)?)?(?:\?{1}([^\s\n\#\[\]]*))?([\#][^\s\n]*)?\)?/gi;

// some text with link
var text = 'Awesome tune, check it out! http://www.youtube.com/watch?v=hVW9eH_PUi8';

// magic
text = text.replace(urlRegex, shortenUrl);

The result: Awesome trance-dubstep tune, check it out! youtube.com/watch?…

We hope that you find this texhnique useful. If you notice any bugs, or have some suggestions about RegEx, please feel free to write below in the comments.

25 comments

  1. Scott Feinstein

    Very nice! I like that you provide a variety of test cases. I notice the regex doesn’t match:

    Without protocol, without slash but with query string params
    www test. test.com?foo=bar

    1. Yeah, haven’t tried that case. I will fix it as soon as possible.
      EDIT: Fixed ;)

    1. Chris

      ^ it removed the bracket. Anyway so it accept the bracket: “[ID]”;

  2. Ha! Many, many thanks! Was hoping to find a reg ex for URL detection, but you gave me *exactly* what I needed it for :)

  3. a couple of cases I found that don’t seem to work as expected:

    thrivehive.com does not match anything (trailing slash is required)
    http://thrivehive.com does not match anything (trailing slash is required)
    http://thrivehive.co.uk returns .uk as the file extension (unless you have a trailing slash)

    1. Works fine for me, try it in “test regex” link, you’ll see they work.

      1. It works in the “test regex” link, but I think that’s because they’re surrounded by empty space. When I pass a url to match as a string it says that “.uk” is the file extension.

        For example, in javascript, this works: ” http://example.com “.match(pattern), but this doesn’t: “http://example.com”.match(pattern)

          1. Awesome, thank you so much. This is an awesome regex, really!

            Have you seen John Gruber’s “An Improved Liberal, Accurate Regex Pattern for Matching URLs”? It’s pretty good. I did a write up comparing yours to his, here: http://bit.ly/15xStKW One thing missing from your regex is support for special characters, like ✪. Would you consider including this?

          2. Also, I don’t know if this is a use case you care about, but links that provide a username and password (e.g. for ftp servers) fail to match. For example, “ftp://user:pass@perlide.org/pub/Makefile.PL”

          3. Nice article :)

            Yeah, I’ll definitelly update it to support special characters, I’ve missed that.

            Also, I’ll improve ftp match, thanks for pointing it out.

            And those brackets (), have to check why it didn’t pick them up.

            So expect a new version in a day or two ;)

  4. I tend not to leave a response, but after browsing through a
    ton of remarks here URL Regular Expression & JavaScript Link Shortener | Some Web Log.
    I actually do have 2 questions for you if it’s okay.
    Could it be simply me or does it appear like a few of these remarks come across as if they are left by brain dead people?
    :-P And, if you are writing on other sites, I would like
    to follow everything fresh you have to post. Could
    you post a list of every one of your social sites like your Facebook page,
    twitter feed, or linkedin profile?

  5. Jaya

    Good One! Just failing this case:
    var string=’test‘;
    I want to get only url from this string but its giving me wrong match http://www.google.com“>test

  6. SONU DHAKAR

    this is awesome, and working for me,

    thanks
    Regards –
    sonu dhakar

  7. That has to be one of the most impressive bits of Regex I’ve ever seen. Then again, I’m not all that experienced with Regex so maybe this is actually pretty basic ;). Either way, thanks for sharing!

Leave a Reply

Your email address will not be published. Required fields are marked *