Wednesday, April 23, 2008

OpenXML() Rocks!

But XML, maybe not so much...

XML files are sometimes used to transmit data between relational databases using a hierarchical structure that is non-relational in nature. SQL Anywhere provides very fast and fairly-easy-to-use facilities to parse and load this XML data into relational tables: xp_read_file() and INSERT t SELECT ... FROM OPENXML().

Here's a story some of you will be painfully familiar with: Company A is sending data to Company B, from one computer system to another. Company A has the power to define what the data transmission looks like (no, it's not Walmart, but it could be), and it chooses to switch to XML.

Company A used to let Company B get the data in a synchronous fashion by running SQL queries against Company A's database. There's no real reason to stop doing that, especially since Company B is a wholly-owned subsidiary of Company A. Nevertheless, it's a new world, everyone's learning about XML in school, and everyone wants to use it as soon as they graduate.

So now, instead of nine queries to copy about 31M of database data from nine SQL Server tables, a single 84M XML file is created by Company A, transmitted across the internet and then processed by Company B. This file contains all the data in a deeply nested hierarchical structure under one single top-level XML tag.

... and that's reminiscent of IBM's mainframe "Information Management System" hierarchical DBMS from the 1970s.
Once upon a time, about 30 or 40 years ago, there were three kinds of database managment systems competing for popularity: Mind-numbingly complex "network" DBMS products, not-quite-so-bad "hierarchical" products, and the brand-new, wonderfully-simple (and eventually victorious) table or "relational" databases. With network and hierarchical databases, programmers wrote lots of code that simply navigated from node to node, with instructions like "Get Unique", "Get Next", "Get Next Within Parent" to traverse the tree, and "Get Hold Next Within Parent" to signal update intent. Similarly, with hierarchical representations of data in XML, the notion of navigation becomes important once again... except today it's given a patina of modernity with terms like "parsing" and "shredding".
But enough ranting... The engineers behind SQL Anywhere have given us very fast and fairly-easy-to-use facilities to parse and load XML data into relational tables, which you can code inside a stored procedure:
  • Step 1: Call xp_read_file() to load the XML data into a local LONG VARCHAR string variable. That's fast and easy.

  • Step 2: Execute INSERT t SELECT ... FROM OPENXML() to parse the XML and load the table. Also fast, but easy? We'll see...
Step 1 is currently limited to 2G for the string value. For larger files, say, the entire contents of Wikipedia, you're going to have to do something different (don't laugh, you *can* download all of Wikipedia in a single XML file, Jason Hinsperger's done it, and loaded it into a SQL Anywhere database, and I hope he'll talk about it here.)

Step 2 is executed once for each table. For this particular story, that means parsing the XML string nine times, with nine different INSERT t SELECT ... FROM OPENXML() statements. Sounds wasteful, but it's fast, so it falls into the "who cares?" performance category.

The Input

Here's a snippet from the giant XML file:



For this discussion, we're looking for the XML <Line...> element at the bottom of the picture.

The Output

The data from each <Line...> element (and there are thousands of them in the full file) is going to be inserted into a row in the RDN_Line table:
CREATE TABLE RDN_Line (   
artificial_pkey BIGINT NOT NULL DEFAULT AUTOINCREMENT
PRIMARY KEY,
InternalId VARCHAR ( 15 ) NOT NULL,
Line VARCHAR ( 50 ) NOT NULL,
RDNFlag VARCHAR ( 5 ) NOT NULL,
RDNUpdated VARCHAR ( 100 ) NOT NULL );
  • RDNLine.Line = the data value between the <Line...> and </Line> tags; e.g., "Linguistics".

  • RDNLine.RDNUpdated = the RDNUpdated attribute from the <Line...> tag; e.g., "2008-03-18T22:29:07.0000000-05:00".

  • RDNLine.RDNFlag = the RDNFlag attribute from the <Line...> tag; e.g., "true".

  • RDNLine.InternalId = the data value between the <InternalId...> and </InternalId> tags; e.g., "2195445AG".
That last one is really tricky: just where is the <InternalId...> tag? This is what makes the XML hierarchical, like good old IMS, and not relational, like databases in modern times... you have to navigate from the <Line...> tag, up past <Lines> to the parent <Agency...>, and then back down past <InternalIds> to the <InternalId...> tag. That's where you'll find the value to put in RDNLine.InternalId.
And that's what makes this stuff complex, not the SQL syntax, but the way people create XML. It's a license to obfuscate, and the best (worst) part is this: those same people claim XML is simple, straightforward and self-documenting... sure it is, and the emperor is fully clothed.

Oh, and what about the artificial_pkey column? Logically the primary key to RDN_Line is ( InternalId, Line ) but because the XML is stored in an ordinary text file there's no guarantee it's valid. The code discussed here simply loads the XML data into a table for further processing which includes edits for validity that aren't shown here. The key word is "robust": no runtime exceptions, no matter what kind of garbage is received.

The Processing

Here's the INSERT statement that parses each <Line...> element and puts the data into RDN_line:
INSERT RDN_Line (
InternalId,
Line,
RDNFlag,
RDNUpdated )
SELECT TRIM ( COALESCE ( InternalId, '???' ) ),
TRIM ( COALESCE ( Line, '???' ) ),
TRIM ( COALESCE ( RDNFlag, '???' ) ),
TRIM ( COALESCE ( RDNUpdated, '???' ) )
FROM OPENXML (
@xml,
'/Branches/Agencies/Agency/Lines' )
WITH ( InternalId VARCHAR ( 15 ) '../InternalIds/InternalId',
Line VARCHAR ( 50 ) 'Line',
RDNFlag VARCHAR ( 5 ) './Line/@RDNFlag',
RDNUpdated VARCHAR ( 100 ) './Line/@RDNUpdated' );

Yes, it looks ugly, and yes, it *is* ugly... so why did I call it "fairly-easy-to-use"?

Because it's short, for one thing. Ten or twenty lines of complex code are usually easier to deal with than a hundred or a thousand lines of complex code.

And the processing logic is divided into separate components, each of which can be understood once someone explains it to you. Which is what I'm going to try do do: divide and conquer.

First, a high-level view, reading from the bottom up:
  • The WITH clause at the bottom tells the builtin OPENXML() procedure how to parse the XML data to find the column values you're looking for.

  • The two arguments passed to OPENXML() do more of the same thing: they tell OPENXML what to do.

  • The FROM clause makes use of SQL Anywhere's ability to treat a procedure call just like a table. If a procedure returns a result set (and OPENXML does that), you can call it in a FROM clause.

  • The SELECT list tells SQL Anywhere what to do with each column in the result set returned by OPENXML().

  • The INSERT list tells SQL Anywhere where to put columns from the SELECT list. And (finally :) it's done!
No, it's still not easy, not even "fairly easy".

The Syntax

Here is the pseudo-SQL for Steps 1 and 2 described earlier, the call to xp_read_file() and the INSERT t SELECT ... FROM OPENXML():
BEGIN
DECLARE @xml LONG VARCHAR;

SET @xml = xp_read_file ( 'filespec-for-XML-data' );

SET TEMPORARY OPTION STRING_RTRUNCATION = 'OFF';

INSERT target-table (
target-column-name-1,
target-column-name-2,
... )
SELECT column-name-1,
column-name-2,
...
FROM OPENXML ( @xml,
'Xpath-query-to-find-starting-point' )
WITH ( column-name-1
data-type-1
'Xpath-subquery-to-get-value-1',
column-name-2
data-type-2
'Xpath-subquery-to-get-value-2',
... );
COMMIT;
END;
  • The SET @xml = xp_read_file loads the entire contents of the XML file into a local LONG VARCHAR variable.

  • Setting STRING_RTRUNCATION = 'OFF' is done so the OPENXML call will truncate XML data values that are wayyyyy too long, rather than raising an exception; more on this later.

  • The target-column-names in the INSERT list point to the target-table. They don't have to be the same as the column-names used in the SELECT; the correspondence is positional.

  • The column-names in the SELECT correspond by name to the column-names in the WITH clause.

  • The first argument passed to OPENXML is the @xml string filled by the earlier call to xp_read_file().

  • The column-names in the WITH clause are made up by you, and only have local scope (this one SQL statement). Personally, I like to use exactly the same names here as the names that appear in the corresponding XML tags. For example, if <node>value</node> appears in the XML, "node" is what I use as the corresponding column name in the WITH clause.
    The WITH keyword may be the most popular keyword with the designers of SQL Anywhere... try looking it up in the Help and you'll be hard pressed to find what you're looking for, there are so many. In fact, you can have several different styles of WITH clauses one single SELECT: WITH index hints, OPENSTRING WITH, WITH RECURSIVE, WITH local view. The syntax and semantics of any particular WITH clause depends on exactly where it's used, and *this* particular WITH clause applies only to the OPENXML call.

  • The data-types in the WITH clause are chosen by you, to tell the OPENXML call how to convert the XML string data values when it finds them.

    If you are expecting garbage values in the XML the only truly safe data type is LONG VARCHAR which will accept pretty much anything. But that's extreme; personally, I prefer VARCHAR with some reasonable maximum length, and turn STRING_RTRUNCATION = 'OFF' to chop off unreasonably long values. You can still perform edit checks for the "too long" condition by specifying a length that is longer that what you really want, but there is no point storing the entire contents of an infinitely long string (where "infinite" means "way too long").

The Xpath Zone

So what's left? The two kinds of 'Xpath-...' strings, that's what... by far the weirdest parts of this whole INSERT t SELECT ... FROM OPENXML() statement.

The 'Xpath-...' strings take us out of the SQL language and into something called the "Xpath Query" language... you can read about that language here, but file a flight plan first; i.e., let your next-of-kin know where you're going in case you need rescuing later.

Once again, here's the actual code for the INSERT t SELECT ... FROM OPENXML():
INSERT RDN_Line (
InternalId,
Line,
RDNFlag,
RDNUpdated )
SELECT TRIM ( COALESCE ( InternalId, '???' ) ),
TRIM ( COALESCE ( Line, '???' ) ),
TRIM ( COALESCE ( RDNFlag, '???' ) ),
TRIM ( COALESCE ( RDNUpdated, '???' ) )
FROM OPENXML (
@xml,
'/Branches/Agencies/Agency/Lines' )
WITH ( InternalId VARCHAR ( 15 ) '../InternalIds/InternalId',
Line VARCHAR ( 50 ) 'Line',
RDNFlag VARCHAR ( 5 ) './Line/@RDNFlag',
RDNUpdated VARCHAR ( 100 ) './Line/@RDNUpdated' );

The 'Xpath-query-to-find-starting-point' string is '/Branches/Agencies/Agency/Lines', and it means this: Start at the root "/" of the XML document, and navigate down the tree to the <Lines> tag (that's Lines plural, not Line singular), and start the parsing there. Not just the first <Lines> tag, but each and every one of them: find them all, parse them all.

The first 'Xpath-subquery-to-get-value-...' is '../InternalIds/InternalId' which means this: Go up one level "../" from the <Lines> tag to the <Agency...> tag, then down two levels to the <InternalId...> tag, and pick up the data value "2195445AG" from between the <InternalId...> and </InternalId> tags.
If you're wondering why we need this InternalId value, it's because it is the logical primary key for all the data under the <Agency...> tag. In the XML it only appears once, in the Agency element, but on the relational database side of things it must be repeated as a logical foreign key column in each of the child tables (RDN_Line.InternalId in this case). So that's why all the navigating, up-and-down... air-sickness bags optional but recommended.

The second Xpath subquery is simply 'Line' which means this: Pick up the data value "Linguistics" from between the <Line...> and </Line> tags.

The third Xpath subquery './Line/@RDNFlag' means this: Look inside the <Line...> tag itself, and pick up the value for the RDNFlag attribute, which is "true". That's what the "@" means: look for an attribute value, not an element data value.
As far as I know the "./" directive means "start here, where you are, and look down". That's rather redundant, and I'm not sure it's necessary... but it works, and I have less-than-zero interest in exploring alternative Xpath queries to do the same thing :)

The last Xpath subquery './Line/@RDNUpdated' picks up the value for the RDNUpdated attribute, which is "2008-03-18T22:29:07.0000000-05:00", a kind of funky committee-designed timestamp-with-timezone format.

The Results

Here's a look at that one row in RDN_Line...



Here are the numbers for a full test on a laptop...
  • 84M XML file containing data for 9 tables in a deeply nested tag hierarchy

  • Adaptive Server Anywhere Database Engine Version 9.0.2.3575

  • HP Pavilion zd8000 laptop 3.4Ghz Pentium 4 running XP SP2

  • Method: xp_read_file into a LONG VARCHAR variable then parse 9 times via INSERT t SELECT ... FROM OPENXML()

  • 9 tables, 71 columns, 338,917 rows

  • 36.5M total database space = 30.7M table + 5.8M index

  • 2.5 minutes on a cold cache

13 comments:

Anonymous said...

OpenXML() rocks and you rock too!!!
Thanks very much for responding to my query at sybase.public.sqlanywhere (Ref: Pushing data from XML to Sybase tables). I read your comment and your entire blog post also. It is so cool. You have given so much details at every level, it can not get better. It works and I really liked it (more so because I am new to this area).

However I still have not figured out if it _has_to_be a procedure or can it be a sql query also. I am using Interactive SQL (SQL Anywhere 10) and it requires stuff like BEGIN, END etc. Please reply (at your blog) if there is a way it can be done in a SQL statement also.

Thanks very much once again.
Keep writing good stuff.

~ Abhishek

PS: My real name is Abhishek. Somehow the NNTP client I used to post the previous message messed up my name and email id and showed some one else's. :-)

Breck Carter said...

Your wish is my command! :)

http://sqlanywhere.blogspot.com/2008/05/executing-begin-blocks.html

Anonymous said...

Hi Breck,
I am having a trouble with OpenXML() and thought of discussing with you. I am trying to insert a big (but simple) XML into a single Sybase table... there are 37804 records which will map to as many rows in the target table. But when I execute this query, it returns after a couple of minutes only to display an error "Insufficient memory for OPENXML query (-912)" :-(

I looked at the error description at http://www.ianywhere.com/developer/product_manuals/sqlanywhere/0901/en/html/dberen9/00000273.htm but not to much avail.

Can you provide some pointers? From your original post, it looked like OpenXML() will be able to handle large amount of data without any problems but looks like it is not.

Anonymous said...

The link I mentioned got messed up in the comment. Here is what I looked at:
http://www.ianywhere.com/developer/
product_manuals/sqlanywhere/0901/en/
html/dberen9/00000273.htm

Breck Carter said...

You should post a question like this on the general SQL Anywhere newsgroup; see the "how to" at http://www.sybase.com
/detail?id=1002557

(hmmmm, I'm starting to dislike blogger... :)

Anyway, I have not seen that error. The description suggests you increase the database cache, and if you are using a busy server with lots of connections your usage of the RAM cache may be limited. Send me an email with more details, and we'll just post the solution here:
breck dot carter at gmail dot com

Insufficient memory for OPENXML query
SQLCODE –912

The memory used by an OPENXML query may be proportional to the size of the largest subtree in the XML document that has nodes referred to by a single row in the result. Try increasing the cache size or rewriting the query so that rows refer to smaller subtrees.

Anonymous said...

Here's a link to an article on the newsgroup that explains this issue:

http://forums.sybase.com/cgi-bin/webnews.cgi?cmd=item-120839&group=sybase%2Epublic%2Esqlanywhere%2Egeneral

Abhishek, I'm willing to give detailed help with finding a workaround if you wish.

Breck Carter said...

Here's the full link...

http://forums.sybase.com/cgi-bin/webnews.cgi
?cmd=item-120839&group=sybase
%2Epublic%2Esqlanywhere%2Egeneral

Anonymous said...

Hi Brent,
I recently posted more details about this problem on the Sybase forum, in reply to your post there. Please see if it gives you more idea on the same...

Here is the link:

http://groups.google.com/group/
sybase.public.sqlanywhere.general/
browse_thread/thread/
cc231fbbff7b4689/7c57c09e935ee9c1

Anonymous said...

Apologies for referring you as Brent instead of Breck in the previous post and also on the newsgroup message.

Gio said...

Is there any easy way to do the oposite operation, i.e. dump data from a SA10 table as XML?

Breck Carter said...

@Gio: There's Interactive SQL OUTPUT statement and the ADO.NET DataSet object, then there's UNLOAD SELECT ... FOR XML AUTO, FOR XML EXPLICIT, FOR XML RAW, XMLAGG, XMLCONCAT, XMLELEMENT, XMLFOREST, XMLGEN, not to mention XML web services... check out the SQL Anywhere 11 Help "Using XML in the database" dcx.sybase.com/1101en/dbusage_en11/ug-sqlxml.html (which seems to be down at this exact moment)

Breck Carter said...

@Gio: Ooops, you did say SA10, didn't you? Pretty much the same, this stuff's been around for a long time, but I'm not sure if the docs are available online at dcx.sybase.com, you may have to use the compiled HTML Help that comes with the download.

Anonymous said...

Hi,

I have a simple XML. When I parse that xml with OPENXML , I am getting NULL values.Can some one help me if I'm doing wrong.

XML
====================================



Rikke

Norberg


xxx@gmail.com





User




User_2








Stored Procedure for Parsing
====================================
create procedure uspparsexml_2

(

@strXML VARCHAR(MAX)

)

AS

BEGIN

SET NOCOUNT ON



DECLARE @docHandle INT

EXEC sp_xml_preparedocument @docHandle OUTPUT, @strXML

declare @FirstName varchar(50)

SELECT @FirstName=FirstName



FROM OPENXML (@docHandle, 'UserRole/FirstName')



WITH (FirstName VARCHAR(50) 'FirstName')

print @FirstName

select @FirstName

EXEC sp_xml_removedocument @docHandle

print 'last'

print @FirstName

select @FirstName

print 'last'



END