Splitting huge XML files into smaller ones


Ever had a huge XML file you could not open with your normal text editors?

I had a XML file that had a size of around 1 gigabyte. I had to read it into database but I found out that our version of Oracle XML functions had a limit of only 65536 subelements in an element. My file had elements 30 times more...I had to find a solution quickly. ( The error is ORA-30936: Maximum number of XML node elements exceeded )

The file was basically "rows" of individual XML objects and these individual "rows" were not huge in size. It was the amount of these rows that was huge. So if I could split these rows into multiple files and read these file one by one, the problem would be solved.

I was not allowed to download any software to split my XML file and could not find any script from the Internet to do the job. So I had to write my own script.

The parameters you need are the name of the file, how any "rows" you want in one split file and the name of the XML tag that marks the "row" element.

Example file Data.xml:
<root>
       <text>This is pre tag text</text>
<root2>
 <element>Row1</element>
 <element>Row2</element>
 <element>Row3</element>
 <element>Row4</element>
 <element>Row5</element>
 <element>Row6</element>
</root2>
        <text>This is post tag text</text>

</root>

When you start the script, it first searches the "posttag". That is the end of the file after the last </element>-tag. It writes this "posttag" to *.PostTag.xml file.

Then it searches the "pretag" that is the start of the file before the first <element>-tag. It saves this to *.PreTag.xml file.

Then it starts reading the "rows" which in this example are marked with <element>-tag. It creates the first file, appends the PreTag file to it, then the amount of "rows" you wanted in one file and finally appends the PostTag file. It does this loop until all rows are processed.

I wrote the script in Powershell and command parameters are:
powershell -file XMLSplit.ps1 FileToBeSplitted.xml NumberOfRowsInOneFile RowTag

For the example file the command might be:

powershell -file XMLSplit.ps1 Data.xml 1 element

The first split file Data.xml.0000.xml would look like this:
<root>
       <text>This is pre tag text</text>
<root2>
 <element>Row1</element>
</root2>        <text>This is post tag text</text>
</root>

The first posttag file creation is almost half of total execution time. I don't use any "tail"-command to get the last lines, but have to use what is available on Powershell 2.0. If you have Powershell 3.0 you
can try to use new Tail-parameters for Get-Content. Reading of 2GB file takes about 3 hours and in my case the execution time wasn't critical any more.

Without this script the project would have been delayed. The first basic version of the script about an hour to write and test but the execution time was 22 hours. The optimized version took another hour to write and test and the execution time dropped to 3 hours. 

The script:
XMLSplit.ps1

Edit:
The files are also in GitHub:
https://github.com/MarkoMarjamaa/XMLSplit

Comments

  1. This comment has been removed by the author.

    ReplyDelete
  2. Your script worked WONDERFULLY for some of the XML files I need to split! However, I have a set of XML files where the "element" is unique (CaseRecord CaseID="[ID number]"). Is it possible to tweak your script to accommodate this?

    ReplyDelete
    Replies
    1. I'm not giving you a complete solution ( sorry) but it should not be to difficult. The actual loop only seeks for end-tags . Only place where script searches for start-tag is row 33
      # process the line
      if($line.Contains("<$($element_name)>")){
      Write-Host "First element found"
      $pos = $line.IndexOf("<$($element_name)>")
      If you change that to:
      # process the line
      if($line.Contains("<$($element_name) ")){
      Write-Host "First element found"
      $pos = $line.IndexOf("<$($element_name) ")

      Then it should work. Only problem will be that if your schema contains also other element names that start with "<CaseRecord " then I'm not sure how it will behave.

      This is the easiest way. It's also possible to search for "perfect" solution maybe with regular expressions but I don't think it's needed in this case.

      When the actual loop copies the rows, it also will copy automatically the CaseId number.

      Delete
  3. It looks like that works! For some reason I found the respective code around line 65. Not sure if my text editor displays the script differently than yours. Anyway, thanks for sharing this!!

    ReplyDelete
  4. Hi Marko Marjamaa,

    I am unable to get the result. I am getting below error. Please help me to run this script.

    Get-Content : Cannot find path 'C:\Users\csujay\Desktop\data\Split\Data.xml.PostFix.xml' because it does not exist.
    At C:\Users\csujay\Desktop\data\Split\XMLSplit.ps1:130 char:17
    + Get-Content "$($source_file).PostFix.xml" | %{
    + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo : ObjectNotFound: (C:\Users\csujay...xml.PostFix.xml:String) [Get-Content], ItemNotFoundEx
    ception
    + FullyQualifiedErrorId : PathNotFound,Microsoft.PowerShell.Commands.GetContentCommand

    ReplyDelete
  5. Sorry, but I don't have the time to debug your environment.
    Does the file exist?
    Get-Content is very basic Powershell function, so the problem is not in my code.

    ReplyDelete

Post a Comment