Detecting URLs in text strings is something that you will probably need sometime when parsing some text in your applications, and their shortening can also be quite handy (like Twitter does for example).
Matching URLs is not so difficult, but matching them and grouping all URL fragments is. Especially if you want to do that in one match. So, here is one URL Regular Expression to rule them all.
RegEx
Regular expression (or RegEx) is a form of a parsing language which purpose is to perform the matching of particular words, patterns or characters within the text string.
Someone once said, RegEx is like a language of its own. At first it may look like bunch of random characters, but actually it’s a pretty useful technique for both simple and complex matching of whatever you need within text string. It is supported by all major programming languages (PHP, Perl, JavaScript, Java, .NET, etc.)
URL
We all know how URL looks like. Here you can see URL anathomy (along with good SEO practices), or just search across the web for it.
URL Regular Expression
In order to match and group all the fragments of URL, and all complex situations of different URL variations, our RegEx is a little-bit longer (324 chars), but it captures and groups all parts nicely (tested it in PHP and JavaScript).
/* URL-RegEx 1.2 */
\(?(?:(http|https|ftp):\/\/)?(?:((?:[^\W\s]|\.|-|[:]{1})+)@{1})?((?:www.)?(?:[^\W\s]|\.|-)+[\.][^\W\s]{2,4}|localhost(?=\/)|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(?::(\d*))?([\/]?[^\s\?]*[\/]{1})*(?:\/?([^\s\n\?\[\]\{\}\#]*(?:(?=\.)){1}|[^\s\n\?\[\]\{\}\.\#]*)?([\.]{1}[^\s\?\#]*)?)?(?:\?{1}([^\s\n\#\[\]]*))?([\#][^\s\n]*)?\)?
It captures 8 groups (plus zero group that contains entire URL). If some of them don’t exist in URL, that group will return empty.
- Entire URL – url being parsed
- Protocol – http, https, ftp
- Userinfo – username:password
- Domain – www.mydomain.com, mydomain.com, 127.0.0.1, localhost…
- Port – 80
- Path / Folders – /folder/dir/
- Page / Filename – eg. index
- File extension – .html, .php…
- Query – item=value&item2=value2
- Anchor – #home
You can see it in action, and also test your own URLs:
Test RegEx
There is just one little catch. Because brackets () are allowed in URLs, lets imagine someone put the URL inside brackets that are not part of the URL.
Our blog (http://someweblog.com) is awesome, isn't it?
So, in order not to mix up some bracket that is not part of the URL, we are capturing the brackets before and after URL as well, and all you have to do after matching is to check if the both brackets exist and remove them. Something like this (in JavaScript):
if ( str.charAt(0) == '(' && str.charAt( str.length-1 ) == ')' ) {
str = str.slice(1,-1);
}
How does it work?
If you really want to know how, or have some uncertainties, just write below in comments. I just couldn’t get myself to write this right now, but I will if someone is interested.
JavaScript link shortener
Now here comes the fun part, once we’ve matched an URL, we can do with it whatever we like. For example, you no longer have to worry about long links messing up your text. Here is Twitter like way of doing it:
var shortenUrl = function(url,protocol,host,port,path,filename,ext,query,fragment) {
// set url length limit
var limit = 20,
show_www = false;
// remove brackets if URL inside them
if ( url.charAt(0) == '(' && url.charAt( url.length-1 ) == ')' ) {
url = url.slice(1,-1);
}
// add protocol if doesn't exist
if ( !protocol ) {
url = 'http://' + url;
}
// create new url to show
var domain = show_www ? host : host.replace(/www\./gi, '');
var visibleUrl = domain + (path || '/') + (filename || '') + (ext || '') + (query ? '?'+query : '') + (fragment || '');
// shorten URL if bigger than limit
if ( visibleUrl.length > limit && domain.length < limit ) {
visibleUrl = visibleUrl.slice(0, domain.length + (limit - domain.length)) + '...';
}
return '' + visibleUrl + '';
};
// our URL RegRx
var urlRegex = /\(?(?:(http|https|ftp):\/\/)?(?:((?:[^\W\s]|\.|-|[:]{1})+)@{1})?((?:www.)?(?:[^\W\s]|\.|-)+[\.][^\W\s]{2,4}|localhost(?=\/)|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(?::(\d*))?([\/]?[^\s\?]*[\/]{1})*(?:\/?([^\s\n\?\[\]\{\}\#]*(?:(?=\.)){1}|[^\s\n\?\[\]\{\}\.\#]*)?([\.]{1}[^\s\?\#]*)?)?(?:\?{1}([^\s\n\#\[\]]*))?([\#][^\s\n]*)?\)?/gi;
// some text with link
var text = 'Awesome tune, check it out! http://www.youtube.com/watch?v=hVW9eH_PUi8';
// magic
text = text.replace(urlRegex, shortenUrl);
The result: Awesome trance-dubstep tune, check it out! youtube.com/watch?…
We hope that you find this texhnique useful. If you notice any bugs, or have some suggestions about RegEx, please feel free to write below in the comments.
jb ruler
Recognizes 601.984.23 to be url
jb ruler
Splits long first level domains, like .media
Eli
Among all Persian URL Shorteners, http://lish.ir/ is the best one. Everybody can shorten the links without registration but this site has many utilities for whom register in it. I use this site and it is very practical and fantastic. This site has four types of link for their users; tracking link, smart link, advertising link (that user can earn money), and rotator link. I suggest you to use this link shortener Web site.
Mark
That has to be one of the most impressive bits of Regex I’ve ever seen. Then again, I’m not all that experienced with Regex so maybe this is actually pretty basic ;). Either way, thanks for sharing!
Donna L. Gifford
This is working completely fine for me. Thanks for sharing.
SONU DHAKAR
this is awesome, and working for me,
thanks
Regards –
sonu dhakar
Jaya
Good One! Just failing this case:
var string=’test‘;
I want to get only url from this string but its giving me wrong match http://www.google.com“>test
Jaya
string have tag which have converted in hyperlink here
R@hul
Yes Bawa, I found You
Adrian
Hi,
It seems that url’s like http://www.test is passing.
Dennis
I’m running into the same issue: as soon as www. is part of the url, it will pass without having a .com or anything.
Any thoughts on that?
Bernadine
I tend not to leave a response, but after browsing through a
ton of remarks here URL Regular Expression & JavaScript Link Shortener | Some Web Log.
I actually do have 2 questions for you if it’s okay.
Could it be simply me or does it appear like a few of these remarks come across as if they are left by brain dead people?
:-P And, if you are writing on other sites, I would like
to follow everything fresh you have to post. Could
you post a list of every one of your social sites like your Facebook page,
twitter feed, or linkedin profile?
David
a couple of cases I found that don’t seem to work as expected:
thrivehive.com does not match anything (trailing slash is required)
http://thrivehive.com does not match anything (trailing slash is required)
http://thrivehive.co.uk returns .uk as the file extension (unless you have a trailing slash)
Some Web Guy
Works fine for me, try it in “test regex” link, you’ll see they work.
David
It works in the “test regex” link, but I think that’s because they’re surrounded by empty space. When I pass a url to match as a string it says that “.uk” is the file extension.
For example, in javascript, this works: ” http://example.com “.match(pattern), but this doesn’t: “http://example.com”.match(pattern)
Some Web Guy
Yep, you were right. Actually I’ve never tried to test just the url. I’ve fixed it now, here is the test: http://jsfiddle.net/2cJMu/
David
Awesome, thank you so much. This is an awesome regex, really!
Have you seen John Gruber’s “An Improved Liberal, Accurate Regex Pattern for Matching URLs”? It’s pretty good. I did a write up comparing yours to his, here: http://bit.ly/15xStKW One thing missing from your regex is support for special characters, like ✪. Would you consider including this?
David
Also, I don’t know if this is a use case you care about, but links that provide a username and password (e.g. for ftp servers) fail to match. For example, “ftp://user:pass@perlide.org/pub/Makefile.PL”
Some Web Guy
Nice article :)
Yeah, I’ll definitelly update it to support special characters, I’ve missed that.
Also, I’ll improve ftp match, thanks for pointing it out.
And those brackets (), have to check why it didn’t pick them up.
So expect a new version in a day or two ;)
David
Also, http://www.example.com?foo=bar
and http://www.example.com?foo=bar don’t seem to match (url parameters on a domain without a slash following it)
Some Web Guy
Try now ;)
Alexander Griffioen
Hm, found a bug!
When shortening this: http://www.site.com/section
I get this: site.comsection
The path var is undefined when the trailing slash is missing.
Some Web Guy
Thanx, corrected now ;)
Alexander Griffioen
Ha! Many, many thanks! Was hoping to find a reg ex for URL detection, but you gave me *exactly* what I needed it for :)
Chris
nice, could you perhaps modify the code so it permits the bracket
like:
http://www.domain.com/clk;12345;v;pc=%5BID%5D
Chris
^ it removed the bracket. Anyway so it accept the bracket: “[ID]”;
Some Web Guy
Square brackets are unsupported in URL and shouldn’t be used.
http://en.wikipedia.org/wiki/Help:URL
Scott Feinstein
Very nice! I like that you provide a variety of test cases. I notice the regex doesn’t match:
Without protocol, without slash but with query string params
www test. test.com?foo=bar
Some Web Guy
Yeah, haven’t tried that case. I will fix it as soon as possible.
EDIT: Fixed ;)