Saturday, December 4, 2010

Packaged liboratidy

I have packaged all my sources and made a unix Makefile. Now you can use htmltidy form your oracle by simply: download, ./configure, make, sudo -E make install.

Check out the package in the liboratidy schema. Have fun :-)

Goto: https://sourceforge.net/projects/liboratidy/files/

SQL> set serveroutput on
SQL> exec tidy.test;
Return Code=1
xml=<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html
xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator"
content=
"HTML Tidy for Linux/x86 (vers 25 March 2009), see www.w3.org"
/>
<title>Foo</title>
</head>
<body>
<p>Foo!&Ntilde;d&para;</p>
</body>
</html>

warnings/errors=line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1
column 19 - Warning: inserting implicit <body>
Info: Document content looks like
XHTML 1.0 Strict
2 warnings, 0 errors were found!



PL/SQL procedure successfully completed.

SQL>

Cheers
Christian

Wednesday, September 15, 2010

Binding libtidy.so to PL/SQL

Impatient go to my next blog: http://oracletidybinding.blogspot.com/2010/12/packaged-liboratidy.html







For some reason I needed to download web pages and store the parsed content into a relational database. I decided to use oracle as my RDBMS since oracle provides a really big set of features for this task (like jobs, pl java, xdb, utl_http and some more) and we already use oracle in our company. So downloading the web side would not be a big deal using the utl_http package inside the database. But when I have started I have quickly figured out that html is not xml and I need to “tidy” the web sites before I can parse them. Since Oracle supports Java as a built in language using jTidy seemed to me to be the easiest and fastest solution. No way! When I tried to ”loadjava” jtidy.jar I got tons of missing dependent classes. After a whole day of loading mega bytes of java code I decided that this cannot be a good solution.

While I was screening the web for a possible good solution I came across the “binding C libraries to oracle” feature called extproc. There is not much documentation on the web and I have never done anything in C but I will give it a shot.  

And here we go, my first blog and it is about binding libtidy to oracle plsql.

First I screened some existing libraries in my Oracle DB and found out that xmlparser is simply the same thing. But you cannot pass structs between oracle and C, so I had to write all using simple data types. My first shot on my develop environment WinXP and Oracle 11g2 and using MinGW looked like this:

Code:





 //File: libtidywrapedora.c  
 #include "tidy.h"  
 #include "buffio.h"  
 #include "stdio.h"  
 #include "errno.h"  
 #include "errno.h"  
   
 TidyDoc tdoc;  
 TidyBuffer output = {0};  
 TidyBuffer errbuf = {0};  
   
 void newParser()  
 {  
  tdoc = tidyCreate();           // Initialize "document"  
 }  
   
 int parse(const char *inXml)  
 {  
  int rc = -1;  
   
  rc = tidySetErrorBuffer( tdoc, &errbuf );    // Capture diagnostics  
  if ( rc >= 0 )  
   rc = tidyParseString( tdoc, inXml );      // Parse the input  
  if ( rc >= 0 )  
   rc = tidyCleanAndRepair( tdoc );        // Tidy it up!  
  if ( rc >= 0 )  
   rc = tidyRunDiagnostics( tdoc );        // Kvetch  
  if ( rc > 1 )                       // If error, force output.  
   rc = ( tidyOptSetBool(tdoc, TidyForceOutput, yes) ? rc : -1 );  
  if ( rc >= 0 )  
   rc = tidySaveBuffer( tdoc, &output );     // Pretty Print  
   
  return rc;  
 };  
   
 void setOpt(int tidyOption, int yesno)  
 {  
  Bool ok;  
  ok = tidyOptSetBool( tdoc, tidyOption, yesno ); // Convert to XHTML  
 };   
   
 char* getXML()  
 {  
  return output.bp;  
 }  
   
 char* getError()  
 {  
  return errbuf.bp;  
 }  
   
 void closeParser()  
 {  
  tidyBufFree( &output );  
  tidyBufFree( &errbuf );  
  tidyRelease( tdoc );  
 }  
   


Compile the file with




  gcc libtidywrapedora.c tidy.dll -shared -o libtidywrapedora.dll


Then copy the resulting .dll File to your ORCALE_HOME/BIN folder. Since Windwos does not use enviroment variables for oracle home path, you can simply select the path from within sqlplus using this tips.

Here is a select getting the path you have to copy your library file to:






   
 select   
  case   
   when (SELECT lower(platform_name) from v$database) like '%win%' then   
    substr(file_spec, 1, instr(file_spec, '\', -1, 1) -1) || '\' /* windows */  
   else   
    substr(file_spec, 1, instr(file_spec, '/', -1, 1) -1) || '/' /*s unix */  
  end ORACLE_LIB_HOME  
   from dba_libraries  
   where library_name = 'DBMS_SUMADV_LIB';  
   

In my case the result was:

ORACLE_LIB_HOME
------------------------------------------------------

C:\app\XPMUser\product\11.2.0\dbhome_1\bin\

SQL>

Next I have created the lib using CREATE LIBRARY TIDY_LIB_TEST IS ‘C:\app\XPMUser\product\11.2.0\dbhome_1\BIN\libtidywrapedora.dll’ and then did a simple test if I can bind the lib to plsql:




   
 set serveroutput on  
   
 declare   
  parserID pls_integer;  
  rc    number;  
  ixml   clob := '<title>Foo</title><p>Foo!';  
  oxml   clob := ' ';  
  exml   varchar2(4000) := ' ';  
   
   
  procedure setopt(option_id pls_integer, yesno_id pls_integer) IS   
    EXTERNAL   
    NAME "setOpt"   
    LANGUAGE C   
    LIBRARY TIDY_LIB_TEST  
    PARAMETERS ( option_id int, yesno_id int);   
   
  procedure closeparser IS   
    EXTERNAL   
    NAME "closeParser"   
    LANGUAGE C   
    LIBRARY TIDY_LIB_TEST;  
   
  procedure newparser IS   
    EXTERNAL   
    NAME "newParser"   
    LANGUAGE C   
    LIBRARY TIDY_LIB_TEST;  
   
  FUNCTION parse (i_xml varchar2) RETURN pls_integer IS   
    EXTERNAL   
    NAME "parse"   
    LANGUAGE C   
    LIBRARY TIDY_LIB_TEST  
    PARAMETERS (i_xml   
          ,RETURN int );  
   
  FUNCTION getXML RETURN varchar2 IS   
    EXTERNAL   
    NAME "getXML"   
    LANGUAGE C   
    LIBRARY TIDY_LIB_TEST  
    PARAMETERS ( RETURN );  
   
  FUNCTION getError RETURN varchar2 IS   
    EXTERNAL   
    NAME "getError"   
    LANGUAGE C   
    LIBRARY TIDY_LIB_TEST  
    PARAMETERS ( RETURN );  
   
 begin   
  newParser;  
  setOpt(23,1);  
  rc := parse(ixml);  
    
  dbms_output.put_line('rc: ' || rc);   
  oxml := getXML;  
  dbms_output.put_line('output: ' || oxml);   
   
  if rc > 0 then  
   dbms_output.put_line('error: ' || getError);     
  end if;  
   
  closeParser;  
 end;  
 /   
  

And here you see the result:
rc: 1
output: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html
xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator"
content=
"HTML Tidy for Windows (vers 14 February 2006), see www.w3.org"
/>
<title>Foo</title>
</head>
<body>
<p>Foo!</p>
</body>
</html>

error: line 1 column 1 - Warning: missing <!DOCTYPE> declaration
Info: Document
content looks like XHTML 1.0 Strict
1 warning, 0 errors were found!



PL/SQL-Prozedur erfolgreich abgeschlossen.

SQL>

Just a few hours of work and it works much more comfortable than the jtidy solution. But there is stil one thing left. Since varchar2 is limited to 4000 characters I rewrote the code for usage with clobs and enabled all possible tidy parameters. Finally I submitted the whole code to tidy.sourceforge.net. And here you can download the code -> Link. The usage is as simple as select tidy.parse(i_clob,'out_xml=yes,input_xml=yes) from dual;


At the end, let me say: It is a pity oracle do not provide something like loadjava for external procedures maybe something called loadc. And do not provide a plsql wrapping tool using introspection, for example like gnome does with http://live.gnome.org/GObjectIntrospection .

Christian