martedì 4 gennaio 2011

Parsing HTML string to get links in Javascript

Recently I needed a Javascript function to retrieve links from a HTML string. Unfortunately I couldn't use third party powerful tools like jquery, so I thought to use RegEx.
Let's assume we have a HTML page like this:
        <a href="" title="Google Site">Google</a>
        <a href="" title="Mozilla Site">Mozilla</a>
        <a href="" title="Blogger Site">Mozilla</a>
This page contains links to Google, Mozilla and Blogger. How can we get the links from the HTML content?
<script language="JavaScript" type="text/javascript">
function getLinks() {
    var html = "<html> \
                <body> \
                <a href=\"\" 
                   title=\"Google Site\">Google</a> \
                <a href=\"\" 
                   title=\"Mozilla Site\">Mozilla</a> \
                <a href=\"\" 
                   title=\"Blogger Site\">Blogger</a> \
                </body> \

    var links = [];

     /[^<]*(<a href="([^"]+)" title="([^"]+)">([^<]+)<\/a>)/g, 
     function() {
        links.push(Array(), 1, 5));

The getLinks() function retrieves the links from the HTML content and puts them into an array. "The slice method creates a new array from a selected section of the links array". Some useful informations about the slice method here.

So at the end we have an "array of array". If we want to retrieve a single element, we can call it as links[x][y], where x is the row and y is the column.
For example, let's assume we want to extract some information from the first link:
alert("First link (Google):\n" +
      "Destination anchor: " + links[0][1] + "\n" +
      "\"title\" attribute: " + links[0][2] + "\n" +
      "Source anchor: " + links[0][3]);
The function has several limits: for example it's case sensitive and depends on the A element. In the case above, the href and title attribute are set, but if we have an A element like this:
<a href="">Google</a>
without the title attribute, the function won't work. In that case, we should modify the regex in this way
     /[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, 
     function() {
        links.push(Array(), 1, 4));
Best regards.