PostHeaderIcon Html parser in C++

When I was in college, I’m looking for html parser in C++, but not found. five years gone, all things is simple.

First, i show the capture of the program:

approval list

In this program we use the flowing tools:

  • WebGrab
  • tidy(c++ libtidy)
  • html parser

today, all these look very well and simple.
Now let’s look at the program structure:

  • WebGrab grabs the page code;
  • tidy clean the html code;
  • html parser get your target html code snippet.

Now, run the progarm:

include webgrab header

#include "WebGrab.h"

code for webgrab:

	CWebGrab grab;
	//set all params
	grab.SetTimeOut(2000);
	//call init
	grab.Initialise(_T("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36"),NULL);

	CString szBuff;

	if(!grab.GetFile(szUrl, szBuff, _T("Opera"),NULL )) {
		return;
	}
#ifdef _UNICODE
	CString szPageW = UTF8Util::ConvertUTF8ToUTF16((char*)szBuff.GetBuffer());
	m_szHtmlPage = szPageW.GetBuffer();
#else
	m_szHtmlPage = szBuff;
#endif

include the include dir which contains the all header files(intro in previous articles);
include tidy in your stdafx.h

#include <tchar.h>
#include <tidy.h>
#include <buffio.h>

link the libtidy.lib(of course on windows, linux is so simple.intro in previous articles);
use tidy:

	CStringA ainput(input);
	TidyBuffer output = {0};
	TidyBuffer errbuf = {0};
	int rc = -1;
	Bool ok;

	TidyDoc tdoc = tidyCreate();                     // Initialize "document"

	ok = tidyOptSetBool( tdoc, TidyXhtmlOut, no );  // Convert to XHTML
	if ( ok )
		rc = tidySetErrorBuffer( tdoc, &errbuf );      // Capture diagnostics
	if (rc>=0) {
		rc = tidyOptSetInt(tdoc, TidyOutCharEncoding, 0);
	}
	if ( rc >= 0 )
		rc = tidyParseString( tdoc, ainput );           // Parse the input
	if ( rc >= 0 )
		rc = tidyCleanAndRepair( tdoc );               // Tidy it up!
	if ( rc >= 0 )
		rc = tidyRunDiagnostics( tdoc );               // Kvetch
	if ( rc > 1 )                                    // If error, force output.
		rc = ( tidyOptSetBool(tdoc, TidyForceOutput, yes) ? rc : -1 );
	if ( rc >= 0 )
		rc = tidySaveBuffer( tdoc, &output );          // Pretty Print

	if ( rc >= 0 )
		result=output.bp;
	else
		result=input;

	tidyBufFree( &output );
	tidyBufFree( &errbuf );
	tidyRelease( tdoc );
	return result;

Include html parser headers you may need like:

#include "AClass/LiteHTMLReader.h"
#include "AClass/HtmlElementCollection.h"

Instantiate the reader which will parse the HTML string.

CLiteHTMLReader theReader;
CHtmlElementCollection theElementCollectionHandler;
theReader.setEventHandler(&theElementCollectionHandler);

If you want to get a specific set of tags with a specific attrib use:

theElementCollectionHandler.InitWantedTag(_T("div"), _T("id"),_T("demo1"));

Call the parser function, the theElementCollectionHandler will be filled with the parsed structure.

theReader.Read(m_szHtmlPage);

At the end, Get the target html code.

CString szTxt((LPCTSTR)m_szHtmlPage+m_dwarrTagStart.GetAt(0),m_dwarrTagLen.GetAt(0));
14,002 views

Leave a Reply

Your email address will not be published. Required fields are marked *

*


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>


Copyright © 2010 - C++ Technology. All Rights Reserved.

Powered by Jerry | Free Space Provided by connove.com