Detecting URLs in text strings is something that you will probably need sometime when parsing some text in your applications, and their shortening can also be quite handy (like Twitter does for example).

Matching URLs is not so difficult, but matching them and grouping all URL fragments is. Especially if you want to do that in one match. So, here is one URL Regular Expression to rule them all.

RegEx

Regular expression (or RegEx) is a form of a parsing language which purpose is to perform the matching of particular words, patterns or characters within the text string.

Someone once said, RegEx is like a language of its own. At first it may look like bunch of random characters, but actually it’s a pretty useful technique for both simple and complex matching of whatever you need within text string. It is supported by all major programming languages (PHP, Perl, JavaScript, Java, .NET, etc.)

URL

We all know how URL looks like. Here you can see URL anathomy (along with good SEO practices), or just search across the web for it.

URL Regular Expression

In order to match and group all the fragments of URL, and all complex situations of different URL variations, our RegEx is a little-bit longer (324 chars), but it captures and groups all parts nicely (tested it in PHP and JavaScript).

/* URL-RegEx 1.2 */

\(?(?:(http|https|ftp):\/\/)?(?:((?:[^\W\s]|\.|-|[:]{1})+)@{1})?((?:www.)?(?:[^\W\s]|\.|-)+[\.][^\W\s]{2,4}|localhost(?=\/)|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(?::(\d*))?([\/]?[^\s\?]*[\/]{1})*(?:\/?([^\s\n\?\[\]\{\}\#]*(?:(?=\.)){1}|[^\s\n\?\[\]\{\}\.\#]*)?([\.]{1}[^\s\?\#]*)?)?(?:\?{1}([^\s\n\#\[\]]*))?([\#][^\s\n]*)?\)?

It captures 8 groups (plus zero group that contains entire URL). If some of them don’t exist in URL, that group will return empty.

  1. Entire URL – url being parsed
  2. Protocol – http, https, ftp
  3. Userinfo – username:password
  4. Domain – www.mydomain.com, mydomain.com, 127.0.0.1, localhost…
  5. Port – 80
  6. Path / Folders – /folder/dir/
  7. Page / Filename – eg. index
  8. File extension – .html, .php…
  9. Query – item=value&item2=value2
  10. Anchor – #home

You can see it in action, and also test your own URLs:

There is just one little catch. Because brackets () are allowed in URLs, lets imagine someone put the URL inside brackets that are not part of the URL.

Our blog (http://someweblog.com) is awesome, isn't it?

So, in order not to mix up some bracket that is not part of the URL, we are capturing the brackets before and after URL as well, and all you have to do after matching is to check if the both brackets exist and remove them. Something like this (in JavaScript):

if ( str.charAt(0) == '(' && str.charAt( str.length-1 ) == ')' ) {
    str = str.slice(1,-1);
}

How does it work?

If you really want to know how, or have some uncertainties, just write below in comments. I just couldn’t get myself to write this right now, but I will if someone is interested.

JavaScript link shortener

Now here comes the fun part, once we’ve matched an URL, we can do with it whatever we like. For example, you no longer have to worry about long links messing up your text. Here is Twitter like way of doing it:

var shortenUrl = function(url,protocol,host,port,path,filename,ext,query,fragment) {
    // set url length limit
    var limit = 20,
	show_www = false;
    // remove brackets if URL inside them
    if ( url.charAt(0) == '(' && url.charAt( url.length-1 ) == ')' ) {
        url = url.slice(1,-1);
    }
    // add protocol if doesn't exist
    if ( !protocol ) {
        url = 'http://' + url;
    }
    // create new url to show
    var domain = show_www ? host : host.replace(/www\./gi, '');
    var visibleUrl = domain + (path || '/') + (filename || '') + (ext || '') + (query ? '?'+query : '') + (fragment || '');
    // shorten URL if bigger than limit
    if ( visibleUrl.length > limit && domain.length < limit ) {
        visibleUrl = visibleUrl.slice(0, domain.length + (limit - domain.length)) + '...';
    }
    return '' + visibleUrl + '';
};

// our URL RegRx
var urlRegex = /\(?(?:(http|https|ftp):\/\/)?(?:((?:[^\W\s]|\.|-|[:]{1})+)@{1})?((?:www.)?(?:[^\W\s]|\.|-)+[\.][^\W\s]{2,4}|localhost(?=\/)|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(?::(\d*))?([\/]?[^\s\?]*[\/]{1})*(?:\/?([^\s\n\?\[\]\{\}\#]*(?:(?=\.)){1}|[^\s\n\?\[\]\{\}\.\#]*)?([\.]{1}[^\s\?\#]*)?)?(?:\?{1}([^\s\n\#\[\]]*))?([\#][^\s\n]*)?\)?/gi;

// some text with link
var text = 'Awesome tune, check it out! http://www.youtube.com/watch?v=hVW9eH_PUi8';

// magic
text = text.replace(urlRegex, shortenUrl);

The result: Awesome trance-dubstep tune, check it out! youtube.com/watch?…

 

We hope that you find this texhnique useful. If you notice any bugs, or have some suggestions about RegEx, please feel free to write below in the comments.