Let's assume we have a HTML page like this:
<html> <body> <a href="google.com" title="Google Site">Google</a> <a href="mozilla.com" title="Mozilla Site">Mozilla</a> <a href="blogger.com" title="Blogger Site">Mozilla</a> </body> </html>
This page contains links to Google, Mozilla and Blogger. How can we get the links from the HTML content?
<script language="JavaScript" type="text/javascript"> function getLinks() { var html = "<html> \ <body> \ <a href=\"google.com\" title=\"Google Site\">Google</a> \ <a href=\"mozilla.com\" title=\"Mozilla Site\">Mozilla</a> \ <a href=\"blogger.com\" title=\"Blogger Site\">Blogger</a> \ </body> \ </html>"; var links = []; html.replace( /[^<]*(<a href="([^"]+)" title="([^"]+)">([^<]+)<\/a>)/g, function() { links.push(Array().slice.call(arguments, 1, 5)); }); alert(links.join("\n")); } </script>
The getLinks() function retrieves the links from the HTML content and puts them into an array. "The slice method creates a new array from a selected section of the links array". Some useful informations about the slice method here.
So at the end we have an "array of array". If we want to retrieve a single element, we can call it as links[x][y], where x is the row and y is the column.
For example, let's assume we want to extract some information from the first link:
The function has several limits: for example it's case sensitive and depends on the A element. In the case above, the href and title attribute are set, but if we have an A element like this:alert("First link (Google):\n" + "Destination anchor: " + links[0][1] + "\n" + "\"title\" attribute: " + links[0][2] + "\n" + "Source anchor: " + links[0][3]);
without the title attribute, the function won't work. In that case, we should modify the regex in this way<a href="google.com">Google</a>
Best regards.html.replace( /[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, function() { links.push(Array().slice.call(arguments, 1, 4)); });