How to parallelize data factory custom activity?

Question

How to parallelize data factory custom activity?

I have 600 files in grib (binary) format which I need to convert to csv format. Initially, it's a one-time conversion, but later on we will receive daily files, so I have implemented a custom C# activity to run in data factory. When I first ran this activity with a batch account with a D14 v2 VM, it converted one file at a time and each file took 20-25 minutes to convert. This adds up to more than a week, which is far too long if we need to re-run the conversion for some reason. Is there any good way to parallelize this conversion?

Both input and output files are stored in blob storage in two separate containers. The input files are ca 50 Mb each and the output files are almost 2 Gb each. The data factory activity converts all files which are found in the input container when the job starts.

I tried adding a thread pool in the custom activity to have a separate thread for each input file to parallelize the work. It works well for approx. 10 files, which takes about 40 minutes to convert, but when adding more input files at the same time the data factory job ends in a strange error without any exception info in the system.log file:

Error in Activity: Process exited with code: '-532462766'. Exception message: 'Exception from HRESULT: 0xE0434352'.

c#

azure-data-factory

asked on Stack Overflow May 17, 2017 by

Magnus Johannesson

1 Answer

I would suggest you use Azure Data Lake to achieve the scalability you need. Assuming the files have some sort of structure writing some USQL to shred them, combine and convert them would work well and would be much easier to implement vs an ADF custom activity that uses batch service compute.

You would of course need to have the files in Data Lake storage instead of blob store to achieve the best performance. But you can use blob storage as a data source to the data lake analytics service if you have to via a wasb URL.

Lastly, you could call the USQL from data factory and specify the degree of parallelism on the activity. Also, maybe have a USQL stored procedure that you pass collections of file names to. Which again can be handled from data factory.

Check out this post: https://www.purplefrogsystems.com/paul/2017/02/passing-parameters-to-u-sql-from-azure-data-factory/

Hope this helps.

Ps. In response to the comment above. Don't copy and paste JSON! Yuk! Use PowerShell with some meta data to output the data factory pipelines using the ADF cmdlets.

answered on Stack Overflow May 18, 2017 by

Paul Andrew • edited May 18, 2017 by

Paul Andrew

User contributions licensed under CC BY-SA 3.0