Loading XML into Oracle’s Autonomous Data Warehouse
I’ve been reading a lot of blogs and searching to determine how to load objects from Oracle Cloud Infrastructure’s Object Storage, Amazon S3, Microsoft Azure Blob Storage and so on into the Autonomous Data Warehouse specifically for XML. There are different options posted and blogs like the links below,
https://www.ateam-oracle.com/using-object-storage-as-a-file-server-for-autonomous-database
I tried some of these out but hit various limitations (one was ORA-29913: error in executing ODCIEXTTABLEFETCH callout ORA-29400: data cartridge error KUP-04020: found record longer than buffer size supported, 10002432), the best approach so far that I have found for loading XML specifically is to use good old fashioned PLSQL BLOB apis and then insert the data into a Binary XML table. The links above have restrictions in terms of scale and XML support. Using the BLOB and Binary XML route avoided those limitations for my case. There are best practices on the types of tables also whether Binary XML table or a shredded relational one is use depends on the use case — see the XMLDB documentation for more;
The XML I was trying to load was the Protein Sequence Database from the link below, this file is ~700Mb;
http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/www/repository.html
Here’s the general approach I’ve taken;
- XML data is landed/copied in to OCI Object Storage
- Copy the Object Storage data to a Binary XML table for this I used the DBMS_CLOUD.GET_OBJECT procedure and an INSERT SQL statement into the binary XML table. This DBMS_CLOUD.GET_OBJECT procedure supports many different URIs including Oracle Cloud Infrastructure, Microsoft Axure, Amazon S3, see the docs here; https://docs.oracle.com/en/cloud/paas/autonomous-database/adbsa/file-uri-formats.html#GUID-5D3E1614-ADF2-4DB5-B2B2-D5613F10E4FA
- Create views on XML using XMLTable — I manually designed them but these could be generated or custom built depending on the use case and what you are trying to do
If when you test your views you get errors such as ‘ORA-29913: error in executing ODCIEXTTABLEFETCH callout ORA-64498: FTP and HTTP access over XDB repository is not allowed on server side’ then check if there are DTD references inside the XML that reference external HTTP sites if so you may need to remove/resolve.
Pretty straightforward, first let’s create the XMLType table storing the XML as binary XML…
CREATE TABLE STAGE_XML (xml_col XMLTYPE) XMLTYPE xml_col
STORE AS SECUREFILE BINARY XML;
Then you can get the object storage data as a BLOB and insert into an XMLType table….
-- Insert XML into Binary XML Data Table - the Oracle database will store as binary XML and eliminate reprocessing of the textdeclare data BLOB;begin data := DBMS_CLOUD.GET_OBJECT( credential_name =>'YOUR_CREDENTIAL', object_uri =>
'https://objectstorage..........oraclecloud.com/n/yournamespace/b/yourbucket/o/yourxml.xml'); execute immediate 'INSERT into STAGE_XML select XMLType(:blobData,873) from dual' USING data;end;
In above I used character set id with value 873, I got this value from the query “SELECT NLS_CHARSET_ID(‘AL32UTF8’) FROM DUAL;”. I used AL32UTF8 as there may be characters from UTF8 character set in data.
You can then query the XML, below I am retrieving all the ProteinEntry ids from the XML file;
— Query / create views on data - this would be similar to reading PersonDetail etc from HCM extractSELECT proteins.* FROM STAGE_XML, XMLTABLE('/ProteinDatabase/ProteinEntry' PASSING STAGE_XML.xml_col COLUMNS "id" varchar2(30) PATH '@id') proteins;
There’s tonnes of examples of using XMLTable for extracting information, its easy to extract from nested collections and much much more, here’s an example reading the protein information and then also getting all the authors;
SELECT proteins.*, authors.* FROM HOLDS_XML, XMLTABLE('/ProteinDatabase/ProteinEntry' PASSING STAGE_XML.xml_col COLUMNS "id" varchar2(30) PATH '@id', "name" varchar2(30) PATH 'protein/name', "organism_source" varchar2(30) PATH 'organism/source', "organism_common" varchar2(30) PATH 'organism/common', "authors" XMLTYPE PATH 'reference/refinfo/authors' ) proteins, XMLTABLE('authors/author' PASSING proteins."authors" COLUMNS "author" varchar2(30) PATH '.' ) authors
That was pretty straightforward and the performance of the queries on the XML were great.
The above was illustrated using the Protein Sequence Database, there’s other applications like Oracle HCM that extract into XML (see the Oracle HCM Extract link here), this approach can also be used there and the SQL views used in Oracle Cloud Infrastructure’s Data Integration dataflows and loaded into your data warehouse targets…
Hope this is useful! Interested to hear opinions and what others have done.