Stefano Bolli: Parsing HTML string to get links in Javascript

Recently I needed a Javascript function to retrieve links from a HTML string. Unfortunately I couldn't use third party powerful tools like jquery, so I thought to use RegEx.
Let's assume we have a HTML page like this:

<html>
    <body>
        <a href="google.com" title="Google Site">Google</a>
        <a href="mozilla.com" title="Mozilla Site">Mozilla</a>
        <a href="blogger.com" title="Blogger Site">Mozilla</a>
    </body>
</html>

This page contains links to Google, Mozilla and Blogger. How can we get the links from the HTML content?

<script language="JavaScript" type="text/javascript">
function getLinks() {
    var html = "<html> \
                <body> \
                <a href=\"google.com\" 
                   title=\"Google Site\">Google</a> \
                <a href=\"mozilla.com\" 
                   title=\"Mozilla Site\">Mozilla</a> \
                <a href=\"blogger.com\" 
                   title=\"Blogger Site\">Blogger</a> \
                </body> \
                </html>";

    var links = [];

    html.replace(
     /[^<]*(<a href="([^"]+)" title="([^"]+)">([^<]+)<\/a>)/g, 
     function() {
        links.push(Array().slice.call(arguments, 1, 5));
    });

    alert(links.join("\n"));
}
</script>

The getLinks() function retrieves the links from the HTML content and puts them into an array. "The slice method creates a new array from a selected section of the links array". Some useful informations about the slice method here.

Test the function

So at the end we have an "array of array". If we want to retrieve a single element, we can call it as links[x][y], where x is the row and y is the column.

For example, let's assume we want to extract some information from the first link:

alert("First link (Google):\n" +
      "Destination anchor: " + links[0][1] + "\n" +
      "\"title\" attribute: " + links[0][2] + "\n" +
      "Source anchor: " + links[0][3]);

The function has several limits: for example it's case sensitive and depends on the A element. In the case above, the href and title attribute are set, but if we have an A element like this:

<a href="google.com">Google</a>

without the title attribute, the function won't work. In that case, we should modify the regex in this way

html.replace(
     /[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, 
     function() {
        links.push(Array().slice.call(arguments, 1, 4));
    });

Best regards.

Stefano Bolli

mercoledì 5 gennaio 2011

Parsing HTML string to get links in Javascript

Cerca nel blog

Archivio blog