Splitting huge XML files into smaller ones

Ever had a huge XML file you could not open with your normal text editors?

I had a XML file that had a size of around 1 gigabyte. I had to read it into database but I found out that our version of Oracle XML functions had a limit of only 65536 subelements in an element. My file had elements 30 times more...I had to find a solution quickly. ( The error is ORA-30936: Maximum number of XML node elements exceeded )

The file was basically "rows" of individual XML objects and these individual "rows" were not huge in size. It was the amount of these rows that was huge. So if I could split these rows into multiple files and read these file one by one, the problem would be solved.

I was not allowed to download any software to split my XML file and could not find any script from the Internet to do the job. So I had to write my own script.

The parameters you need are the name of the file, how any "rows" you want in one split file and the name of the XML tag that marks the "row" element.

Example file Data.xml:
<root>
       <text>This is pre tag text</text>
<root2>
<element>Row1</element>
<element>Row2</element>
<element>Row3</element>
<element>Row4</element>
<element>Row5</element>
<element>Row6</element>
</root2>
        <text>This is post tag text</text>

</root>

When you start the script, it first searches the "posttag". That is the end of the file after the last </element>-tag. It writes this "posttag" to *.PostTag.xml file.

Then it searches the "pretag" that is the start of the file before the first <element>-tag. It saves this to *.PreTag.xml file.

Then it starts reading the "rows" which in this example are marked with <element>-tag. It creates the first file, appends the PreTag file to it, then the amount of "rows" you wanted in one file and finally appends the PostTag file. It does this loop until all rows are processed.

I wrote the script in Powershell and command parameters are:
powershell -file XMLSplit.ps1 FileToBeSplitted.xml NumberOfRowsInOneFile RowTag

For the example file the command might be:

powershell -file XMLSplit.ps1 Data.xml 1 element

The first split file Data.xml.0000.xml would look like this:
<root>
       <text>This is pre tag text</text>
<root2>
<element>Row1</element>
</root2>        <text>This is post tag text</text>
</root>

The first posttag file creation is almost half of total execution time. I don't use any "tail"-command to get the last lines, but have to use what is available on Powershell 2.0. If you have Powershell 3.0 you
can try to use new Tail-parameters for Get-Content. Reading of 2GB file takes about 3 hours and in my case the execution time wasn't critical any more.

Without this script the project would have been delayed. The first basic version of the script about an hour to write and test but the execution time was 22 hours. The optimized version took another hour to write and test and the execution time dropped to 3 hours.

The script:
XMLSplit.ps1

Edit:
The files are also in GitHub:
https://github.com/MarkoMarjamaa/XMLSplit

Comments

AnonymousSeptember 17, 2014 at 5:49 PM
This comment has been removed by the author.
AnonymousSeptember 17, 2014 at 5:51 PM
Your script worked WONDERFULLY for some of the XML files I need to split! However, I have a set of XML files where the "element" is unique (CaseRecord CaseID="[ID number]"). Is it possible to tweak your script to accommodate this?
AnonymousSeptember 18, 2014 at 1:47 AM
It looks like that works! For some reason I found the respective code around line 65. Not sure if my text editor displays the script differently than yours. Anyway, thanks for sharing this!!
UnknownMarch 8, 2016 at 8:41 AM
Hi Marko Marjamaa,

I am unable to get the result. I am getting below error. Please help me to run this script.

Get-Content : Cannot find path 'C:\Users\csujay\Desktop\data\Split\Data.xml.PostFix.xml' because it does not exist.
At C:\Users\csujay\Desktop\data\Split\XMLSplit.ps1:130 char:17
+ Get-Content "$($source_file).PostFix.xml" | %{
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : ObjectNotFound: (C:\Users\csujay...xml.PostFix.xml:String) [Get-Content], ItemNotFoundEx
ception
+ FullyQualifiedErrorId : PathNotFound,Microsoft.PowerShell.Commands.GetContentCommand
Marko MarjamaaApril 8, 2016 at 11:42 AM
Sorry, but I don't have the time to debug your environment.
Does the file exist?
Get-Content is very basic Powershell function, so the problem is not in my code.

Flow : More with less

Search This Blog

Splitting huge XML files into smaller ones

Comments

Post a Comment