Impatient go to my next blog:
http://oracletidybinding.blogspot.com/2010/12/packaged-liboratidy.html
For some reason I needed to download web pages and store the parsed content into a relational database. I decided to use oracle as my RDBMS since oracle provides a really big set of features for this task (like jobs, pl java, xdb, utl_http and some more) and we already use oracle in our company. So downloading the web side would not be a big deal using the utl_http package inside the database. But when I have started I have quickly figured out that html is not xml and I need to “tidy” the web sites before I can parse them. Since Oracle supports Java as a built in language using jTidy seemed to me to be the easiest and fastest solution. No way! When I tried to ”loadjava” jtidy.jar I got tons of missing dependent classes. After a whole day of loading mega bytes of java code I decided that this cannot be a good solution.
While I was screening the web for a possible good solution I came across the “binding C libraries to oracle” feature called extproc. There is not much documentation on the web and I have never done anything in C but I will give it a shot.
And here we go, my first blog and it is about binding libtidy to oracle plsql.
First I screened some existing libraries in my Oracle DB and found out that xmlparser is simply the same thing. But you cannot pass structs between oracle and C, so I had to write all using simple data types. My first shot on my develop environment WinXP and Oracle 11g2 and using MinGW looked like this:
Code:
//File: libtidywrapedora.c
#include "tidy.h"
#include "buffio.h"
#include "stdio.h"
#include "errno.h"
#include "errno.h"
TidyDoc tdoc;
TidyBuffer output = {0};
TidyBuffer errbuf = {0};
void newParser()
{
tdoc = tidyCreate(); // Initialize "document"
}
int parse(const char *inXml)
{
int rc = -1;
rc = tidySetErrorBuffer( tdoc, &errbuf ); // Capture diagnostics
if ( rc >= 0 )
rc = tidyParseString( tdoc, inXml ); // Parse the input
if ( rc >= 0 )
rc = tidyCleanAndRepair( tdoc ); // Tidy it up!
if ( rc >= 0 )
rc = tidyRunDiagnostics( tdoc ); // Kvetch
if ( rc > 1 ) // If error, force output.
rc = ( tidyOptSetBool(tdoc, TidyForceOutput, yes) ? rc : -1 );
if ( rc >= 0 )
rc = tidySaveBuffer( tdoc, &output ); // Pretty Print
return rc;
};
void setOpt(int tidyOption, int yesno)
{
Bool ok;
ok = tidyOptSetBool( tdoc, tidyOption, yesno ); // Convert to XHTML
};
char* getXML()
{
return output.bp;
}
char* getError()
{
return errbuf.bp;
}
void closeParser()
{
tidyBufFree( &output );
tidyBufFree( &errbuf );
tidyRelease( tdoc );
}
Compile the file with
gcc libtidywrapedora.c tidy.dll -shared -o libtidywrapedora.dll
Then copy the resulting .dll File to your ORCALE_HOME/BIN folder. Since Windwos does not use enviroment variables for oracle home path, you can simply select the path from within sqlplus using this tips.
Here is a select getting the path you have to copy your library file to:
select
case
when (SELECT lower(platform_name) from v$database) like '%win%' then
substr(file_spec, 1, instr(file_spec, '\', -1, 1) -1) || '\' /* windows */
else
substr(file_spec, 1, instr(file_spec, '/', -1, 1) -1) || '/' /*s unix */
end ORACLE_LIB_HOME
from dba_libraries
where library_name = 'DBMS_SUMADV_LIB';
In my case the result was:
ORACLE_LIB_HOME
------------------------------------------------------
C:\app\XPMUser\product\11.2.0\dbhome_1\bin\
SQL>
Next I have created the lib using CREATE LIBRARY TIDY_LIB_TEST IS ‘C:\app\XPMUser\product\11.2.0\dbhome_1\BIN\libtidywrapedora.dll’ and then did a simple test if I can bind the lib to plsql:
set serveroutput on
declare
parserID pls_integer;
rc number;
ixml clob := '<title>Foo</title><p>Foo!';
oxml clob := ' ';
exml varchar2(4000) := ' ';
procedure setopt(option_id pls_integer, yesno_id pls_integer) IS
EXTERNAL
NAME "setOpt"
LANGUAGE C
LIBRARY TIDY_LIB_TEST
PARAMETERS ( option_id int, yesno_id int);
procedure closeparser IS
EXTERNAL
NAME "closeParser"
LANGUAGE C
LIBRARY TIDY_LIB_TEST;
procedure newparser IS
EXTERNAL
NAME "newParser"
LANGUAGE C
LIBRARY TIDY_LIB_TEST;
FUNCTION parse (i_xml varchar2) RETURN pls_integer IS
EXTERNAL
NAME "parse"
LANGUAGE C
LIBRARY TIDY_LIB_TEST
PARAMETERS (i_xml
,RETURN int );
FUNCTION getXML RETURN varchar2 IS
EXTERNAL
NAME "getXML"
LANGUAGE C
LIBRARY TIDY_LIB_TEST
PARAMETERS ( RETURN );
FUNCTION getError RETURN varchar2 IS
EXTERNAL
NAME "getError"
LANGUAGE C
LIBRARY TIDY_LIB_TEST
PARAMETERS ( RETURN );
begin
newParser;
setOpt(23,1);
rc := parse(ixml);
dbms_output.put_line('rc: ' || rc);
oxml := getXML;
dbms_output.put_line('output: ' || oxml);
if rc > 0 then
dbms_output.put_line('error: ' || getError);
end if;
closeParser;
end;
/
And here you see the result:
rc: 1
output: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html
xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator"
content=
"HTML Tidy for Windows (vers 14 February 2006), see www.w3.org"
/>
<title>Foo</title>
</head>
<body>
<p>Foo!</p>
</body>
</html>
error: line 1 column 1 - Warning: missing <!DOCTYPE> declaration
Info: Document
content looks like XHTML 1.0 Strict
1 warning, 0 errors were found!
PL/SQL-Prozedur erfolgreich abgeschlossen.
SQL>
Just a few hours of work and it works much more comfortable than the jtidy solution. But there is stil one thing left. Since varchar2 is limited to 4000 characters I rewrote the code for usage with clobs and enabled all possible tidy parameters. Finally I submitted the whole code to tidy.sourceforge.net. And here you can download the code -> Link. The usage is as simple as select tidy.parse(i_clob,'out_xml=yes,input_xml=yes) from dual;
At the end, let me say: It is a pity oracle do not provide something like loadjava for external procedures maybe something called loadc. And do not provide a plsql wrapping tool using introspection, for example like gnome does with http://live.gnome.org/GObjectIntrospection .
Christian