Reading (too?) large xml-file

2

I need to use some data from a xml-file which size is approx > 2 GB (you can have a look: https://leidata.gleif.org/api/v1/concatenated-files/lei2/20180128/zip)

I need the data in access and tried to read the file with the following vba-code:

Public Function ReadLei(strFile As String) As Long
Dim xmlLeiData As New MSXML2.DOMDocument
With xmlLeiData
    .async = False
    .preserveWhiteSpace = False
    .validateOnParse = False
    .resolveExternals = False
End With
If xmlLeiData.Load(strFile) = True Then
    MsgBox "ok"
Else
    MsgBox xmlLeiData.parseError
End If
ReadLei = 0
End Function

its end up with an 0x8007000E out of memory error.

Are there other ways to read and parse such large XML-files via vba/Access?

xml
ms-access
vba
ms-access-2013
asked on Stack Overflow Jan 29, 2018 by DaDirnbocher • edited Jul 13, 2020 by Erik A

3 Answers

1

I don't know specifically what might work under VBA / Access, but building a DOM for this size input probably isn't viable.

The MSXML parser (which you are using) also has a SAX api, where the parser reads the input file and notifies the application of events like start tags, end tags, attributes, and text nodes. This may meet your needs, but the programming can be tricky.

Microsoft's .NET parser (System.Xml) also has a "pull" API, allowing the application to call a "nextEvent()" method provided by the parser, so you can read your way through the file in a structured way. Many people find this a bit easier to use than the SAX approach, though it's still very low-level coding.

A radically different approach would be to use streaming XSLT 3.0 (perhaps in a transformation to reduce the file to a manageable size, which you could then access using DOM in your accustomed way). For that you'll need the commercial version of Saxon. It will cost a bit more but save you time.

Update: you say in a comment that the file contains 1m records, and you only want to retain 4 or 5 columns. You can cut down the file like this in a streaming XSLT 3.0 transformation, where P, Q, R, and S are the wanted columns:

<xsl:transform version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:mode streamable="yes" on-no-match="deep-skip"/>

<xsl:template match="/*">
  <xsl:copy>
    <xsl:apply-templates select="*"/>
  </xsl:copy>
</xsl:template>

<xsl:template match="P|Q|R|S">
  <xsl:copy-of select="."/>
</xsl:template>

</xsl:transform>
answered on Stack Overflow Jan 29, 2018 by Michael Kay • edited Jan 29, 2018 by Michael Kay
1

Since you're not providing specifics, I can't give you specifics.

You can first transform the XML to only contain the data you want using an xslt file and the Application.TransformXML method (documentation), and then import the XML using the Application.ImportXML method (documentation).

Note that the maximum size of an Access database is 2GB. Importing large files can quickly exceed this limit.

answered on Stack Overflow Jan 29, 2018 by Erik A
0

Thanks for the (huge) sample-file. I have been processing xml files for more than 15 years now. I allways had doubts about how Ms Access would perform when nearing the GB-limits.

In my experience, and now confirmed, there is only one winner : Open FileURL For Input As #FileNum. To be combined with InputLine = Input(1000, #FileNum) ' read some 1.000 characters. Basically, just treat the XML as a plain text-file.

The coding would be easier if Line Input could be used in stead of Input, but in your sample this is not the case. Your sample file uses vbLf to mark the end of a line in the text, and Line Input needs vbCrLf to work properly.

I ended up with a small application that first scans a file for the different occuring tags. Afterwards, these tags can be assigned to several tasks as :

  • assign the value to a field in a table
  • just skip the value
  • close child-table

In a second full read all the values are assigned to their destination fields in the database.

I'm gonna try to clarify a bit by inserting some code (as of 02 Feb 2018 15h London time, I have to dash, I am gonna come back to it at a later point of time)


Option Compare Database
Option Explicit

Dim marrKnownTags() As String

Public Sub ReadFile2GB()
Dim FileNum As Integer
Dim InputLine As String

    Call init_marrKnownTags

    FileNum = FreeFile
    Open "X:\20180128-gleif-concatenated-file-lei2.xml" For Input As #FileNum
    Do While Not EOF(FileNum)
        InputLine = Input(99000, #FileNum)   ' read some 99.000 characters
        Call processTemporaryBlock(InputLine)
        ...
    Loop
    Close #FileNum
End Sub

Public Function positionCrOfLf(PieceToScan As String) As Long
Dim Pos As Long

    Pos = 0

    If Pos = 0 Then
        Pos = InStr(PieceToScan, vbCrLf)
    End If
    If Pos = 0 Then
        Pos = InStr(PieceToScan, vbLf)
    End If
    If Pos = 0 Then
        Pos = InStr(PieceToScan, vbCr)
    End If

    'Debug.Print "fie positionCrOfLf := " & Pos
    positionCrOfLf = Pos

End Function


Private Sub init_marrKnownTags()
    ReDim Preserve marrKnownTags(333)
    marrKnownTags(1) = "<?xml version="     ' start of xml
    marrKnownTags(10) = "<lei:LEIData"      ' Table_01 Open
    marrKnownTags(20) = "<lei:LEIHeader>"   ' Table_02 Open
    marrKnownTags(21) = "<lei:ContentDate>" ' field
    marrKnownTags(22) = "<lei:FileContent>" ' field
    marrKnownTags(23) = "<lei:RecordCount>" ' field

    marrKnownTags(30) = "<lei:Extension>"       ' Table_03 Open

    marrKnownTags(40) = "<gleif:Sources>"       ' Table_04 Open
    marrKnownTags(41) = "<gleif:Source>"        ' addnew record Table_04
    marrKnownTags(42) = "<gleif:ContentDate>"   ' field
    marrKnownTags(43) = "<gleif:Originator>"    ' field
    marrKnownTags(44) = "<gleif:RecordCount>"   ' field
    marrKnownTags(45) = "</gleif:Source>"       ' save this new record Table_04
    marrKnownTags(46) = "</gleif:Sources>"      ' Table_04 Close
    marrKnownTags(31) = "</lei:Extension>"      ' Table_03 Close
    ' ... some more child-tables in the future ??
    marrKnownTags(129) = "</lei:Entity>"         ' Table_12 Close ' close child table

    marrKnownTags(140) = "<lei:Registration>"        ' Table_14 Open
    marrKnownTags(141) = "<lei:LastUpdateDate>"      ' DO NOT SKIP field with "2017-11-30T15:06:27Z" =?= 2017-11-30 15:06:27
    marrKnownTags(142) = "<lei:RegistrationStatus>"  ' DO NOT SKIP field with "ISSUED"
    marrKnownTags(149) = "</lei:Registration>"       ' Table_14 Close

    marrKnownTags(2) = "</lei:LEIRecord>"    ' save this new record

    marrKnownTags(2) = "</lei:LEIRecords>"   ' Table_11 Close ' close child table

End Sub

Public Function processTemporaryBlock(ByVal TemporaryBlock As String)
Dim positionStart As Long, positionEnd As Long, positionLength As Long
Dim OneLine As String, searchTag As String
Dim indexArray As Long
Dim tagFoundYN As Boolean
    positionStart = 1
    positionEnd = positionCrOfLf(TemporaryBlock)
    Do While positionEnd > 0
        OneLine = trim(Mid(TemporaryBlock, positionStart, positionEnd - 1))
        Debug.Print "OneLine := " & OneLine
        tagFoundYN = False
        For indexArray = LBound(marrKnownTags) To UBound(marrKnownTags)
            searchTag = marrKnownTags(indexArray)
            searchTag = Trim(searchTag)
            If searchTag = "" Then
                ' skip
            Else
                If Left(OneLine, Len(searchTag)) = searchTag Then
                '    Call processTag(OneLine)
                    tagFoundYN = True
                    exit for
                End If
            End If
        Next
        positionStart = positionStart + positionEnd
        positionEnd = positionCrOfLf(Mid(TemporaryBlock, positionStart))
    Loop
End Sub
answered on Stack Overflow Feb 2, 2018 by JonRo • edited Feb 2, 2018 by JonRo

User contributions licensed under CC BY-SA 3.0