PDA

View Full Version : Regular Expression help - only preserve contents of Anchor tags



Morgenmuffel
29-04-2010, 01:59 PM
Hi all

I am currently trying to get info out of a former frontpage site which is a mess to put it bluntly

Basically all i want out of the pages is the anchor links like below

<a href="/files/sopwith/camel.htm">Biggles and Algie</a>

While finding them should be easy the sheer amount of extraneous tags is making the going painful,

But i can't get my regular expressions working

Morgenmuffel
29-04-2010, 02:09 PM
This works to find the links, but what i want it to do is remove everything else, and i am blowed if i can figure it out, I also xan't get the below code to work in notepad++, but it works in an elderly version of dreamweaver



<a\b[^>]*>(.*?)</a>

Morgenmuffel
29-04-2010, 02:40 PM
I take that back the above code is only finding some links and not all as it isn't finding any that have line breaks in them
eg
<a href="/files/sopwith/camel.htm">Biggles and Algie
</a>
dammit my brain is now officially hurting

Morgenmuffel
29-04-2010, 03:17 PM
Eureka-ish


<a\b[^>]*>([\s\S]+?)</a>


probably not the most elegant, and i still can't work out how to get rid of all the other text on the page, or pipe the result into a new file on windows