试了一把PCRE

虽然我在C++里用正则表达式已经有一些年头了,不过一直都是用的boost里那个库。坦白说,不是很好用。虽然我很早就知道PCRE,但一直都以为这是一个为PHP开发的库。实在是火星人啊。囧

前两天在推土上提起这事时,火炬向我推荐说PCRE比boost里那个正则库好用,于是试了一下,结果可耻滴发现BCB自带了PCRE,只不过没有在文档里提到罢了。

不 过PCRE是一个C语言的库,用起来不够方便。虽然也有PCRE++这种C++封装的版本,但是只提供了GNU编译配置,移植到BCB里估计比较麻烦,因 为我用到的功能也不多,就自己做了个简单的封装,用了一些VCL的AnsiString/StringList之类。用起来方便不少。

#include <pcre.h>

class TPCRE
{
private:
AnsiString FPattern;
pcre * FRE;
TStrings * FMatches;

public:
__fastcall TPCRE(AnsiString aPattern="");
__fastcall ~TPCRE();

void __fastcall compile(AnsiString aPattern="");
int __fastcall exec(AnsiString aStr); // return matched count
AnsiString __fastcall repeat_replace(AnsiString aStr, AnsiString aRepStr="");

__property TStrings * Matches = { read = FMatches };
};

__fastcall TPCRE::TPCRE(AnsiString aPattern)
: FRE(NULL), FMatches(new TStringList())
{
FPattern = aPattern;
if ( FPattern != "" )
compile();
}

__fastcall TPCRE::~TPCRE()
{
if (FRE)
free(FRE);
delete FMatches;
}

void __fastcall TPCRE::compile(AnsiString aPattern)
{
if ( aPattern != "" )
FPattern = aPattern;
const char * error;
int erroffset;
if (FRE)
free(FRE);
FRE = pcre_compile(FPattern.c_str(), 0, &error, &erroffset, NULL);
// if ( FRE == NULL )
// PCRE compilation failed at offset %d: %s, erroffset, error
}

int __fastcall TPCRE::exec(AnsiString aStr)
{
if (!FRE)
throw Exception("No pattern or have not be compiled!");
const int OVECCOUNT = 30;
int ovector[OVECCOUNT];
int rc = pcre_exec(FRE, NULL, aStr.c_str(), aStr.Length(), 0, ovector, OVECCOUNT);
if (rc < 0) {
if (rc == PCRE_ERROR_NOMATCH)
throw Exception("Sorry, no match ...");
else
throw Exception(AnsiString("Matching error ") + IntToStr(rc));
}
// OK, has matched ...
FMatches->Clear();
for (int i = 0; i < rc; i++)
FMatches->Add(aStr.SubString(
ovector[2*i]+1, ovector[2*i+1]-ovector[2*i]));
return rc;
}

AnsiString __fastcall TPCRE::repeat_replace(AnsiString aStr, AnsiString aRepStr)
{
if (!FRE)
throw Exception("No pattern or have not be compiled!");
const int OVECCOUNT = 30;
int ovector[OVECCOUNT];
int rc=1;
char *p = aStr.c_str();
int n = 1;
int len = aStr.Length();
AnsiString s="";
while (rc>0) {
rc = pcre_exec(FRE, NULL, p, len, 0, ovector, OVECCOUNT);
if (rc < 0) {
if (rc == PCRE_ERROR_NOMATCH) {
if (s=="")
s = aStr;
else
s += aStr.SubString(n,aStr.Length()-n+1);
break;
}
else
throw Exception(AnsiString("Matching error ") + IntToStr(rc));
}
// OK, has matched ...
s += aStr.SubString(n,ovector[0])+aRepStr;
n += ovector[1];
p = aStr.c_str()+n-1;
len = aStr.Length()-n+1;
}
return s;
}

用法很简单,这是一段示例代码,可以把HTML转换成TXT:

// 输入:const char *s
// 输出:AnsiString sResult
std::auto_ptr<TPCRE> re(new TPCRE("(?ims)<title>([^<]*)"));
AnsiString sResult;
if (re->exec(s)>1)
sResult = re->Matches->Strings[1].Trim();
re->compile("(?ims)<body[^>]*>(.*)");
if (re->exec(s)>1) {
AnsiString str = re->Matches->Strings[1].Trim();
str = StringReplace(str,"\r","",TReplaceFlags()<<rfReplaceAll);
str = StringReplace(str,"\n","",TReplaceFlags()<<rfReplaceAll);
str = StringReplace(str,"&nbsp;"," ",TReplaceFlags()<<rfReplaceAll);
// replace <br /> to \r\n
re->compile("(?i)<br\s*/?>");
str = re->repeat_replace(str,"\r\n");
// remove <script...>...</script>
re->compile("(?ims)<script[^>]*>.*</script>");
str = re->repeat_replace(str);
// remove <!-- ... -->
re->compile("(?ims)<!--.*-->");
str = re->repeat_replace(str);
// remove <...>
re->compile("(?ims)<[^>]*>");
str = re->repeat_replace(str);
sResult += str;
}

主要就是取出title部分和body部分,然后将body部分的回车全部去掉,&nbsp;替换成空格,br换成回车,脚本和注释去掉,最后去掉所有HTML的标记。