Importing a large data file with sorts and joins in SSIS

2

I have a large file (15GB) that I'm importing with SSIS, which usually isn't a problem, but I'm also using sorts and merge joins before inserting the data into a table. The problem arises when the file is being buffered in the data flow, where I get this error:

Error: The system reports 73 percent memory load. There are 3478020096 bytes of physical memory with 911761408 bytes free. There are 2147352576 bytes of virtual memory with 288587776 bytes free. The paging file has 6954242048 bytes with 3025256448 bytes free.

[Flights File [999]] Error: The attempt to add a row to the Data Flow task buffer failed >with error code 0x8007000E.

Is there a way to split the file into 3 or more smaller files or perhaps another more efficient method to import the data?

Screenshot of the data flow.

ssis
asked on Stack Overflow Jan 24, 2012 by Davenport • edited Mar 24, 2020 by Amira Bedhiafi

1 Answer

3

I'd probably look at restructuring your package like this. I've converted all those Merge Join Transformations to Lookups

enter image description here

A Lookup Transformation most likely accomplishes what you're doing with those merge joins (given key 10, you're retrieving reference data from table B). You don't specify 2005 or 2008/2008R2 so the screens change between versions. The most important thing to be aware of is the behaviour when a match is not found. When it doesn't find a match, by default it will error out. In 2005... I don't even remember 2005 for certain, but I believe it did not have an option for "Redirect rows to no match output." In 2008+, you have 4 options on how to deal with no match found

  • Ignore failure - no harm, no foul rows continue down the pipeline, values nulled out
  • Redirect rows to error output - unmatched data goes down the error output. Doesn't cause the data flow to fail.
  • Fail component - fails the component
  • Redirect rows to no match output - same result as error output except green line connector is used instead of red.

Second most important thing is to be aware of is caching. By default, the cache mode is set to full cache. This means that when the package starts, SSIS will execute the query and create a local cache of all that data. Everywhere in SSIS, you want to only pull back the columns you need to complete your task but lookup most especially. Had a coworker that kept running a server out of memory because they were trying to stream 50GB of data (entire table) into a cache when they only needed 2 columns. For all but your Stations and Metros, full cache is probably fine. If the box has some serious memory, full cache may be fine for that one too. Otherwise, your options are none or partial cache. None will perform singleton queries to the source system for every row that hits the transformation. Useful if the source system is rapidly changing. Otherwise, switch the cache option to Partial, and then figure out how much memory to allocate to it. Partial attempts to be best of both. It will remember it searched for value 10 and keep that in a cache until it the most recently used algorithm determines it's expired.

Final thing about lookups and since you've been using merge join you've already hit it, is that it is case sensitive.

Configuring lookups is easy. Here I specified the lookup should ignore failures. You will know what the appropriate behaviour is for your data.

enter image description here

Do write a query in here, don't take the easy way out of selecting a table, unless you really need every column and every row in that source. And even then, write a SELECT * FROM myTable as it will perform better (selecting a table results in an openrowset query, no need for extra layer)

enter image description here

Column data types must match, again you had already seen this with your merge join but for future readers. If you can cast the lookup data to the correct type versus carrying along a second column in the dataflow that matches the lookup's type, you will get better throughput.

enter image description here

Reference material on Lookups

Merge join route

If you stick with the merge joins, you will again want to access your data via a SQL Command and only pull back the columns you need and when appropriate, cast to the correct data type in your query. Add an explicit ORDER BY MyKeyColumn. Only if you've done this, can you then right click on your OLE DB Source, select Advanced Editor, Input and Ouput Properties.

On OLE DB Source Output, switch IsSorted from False to True.

Expand OLE DB Source Output and then expand Output Columns Find MyKeyColumn and change SortKeyPosition from 0 to 1 (increase by 1 for each sort order provided in original query)

answered on Stack Overflow Jan 24, 2012 by billinkc

User contributions licensed under CC BY-SA 3.0