Creating A Custom NET Activity PipeLine For Azure Data Factory

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

3/1/2018 Creating a Custom .

NET Activity Pipeline for Azure Data Factory - Simple Talk

Feodor Georgiev 11 August 2017

12 6

Feodor
Creating a Custom .NET
Georgiev Activity Pipeline for
11 August
2017
Azure Data Factory
Azure Data Factory provides a radical new
1 cloud-based way of collecting and preparing
2
data in preparation for its storage and
6
analysis. How do you get started with it to
explore the possibilities it provides? Feodor
Georgiev shows the practicalities of how to
go about the task of preparing a pipeline for
use, from preparing the Azure environment
to downloading a le from a FTP to a blob
storage in the Azure environment

The series so far:


1. Creating a Custom .NET Activity Pipeline for
Azure Data Factory

2. Using the Copy Wizard for the Azure Data


Factory

3. The Quick and the Dead Slow: Importing CSV


Files into Azure Data Warehouse

Azure Data Factory is the integration tool in Azure


that builds on the idea of Cloud-based ETL, but uses
the model of Extract-and-Load (EL) and then
Transform-and-Load (TL). To do this, it uses data-
driven work ows called pipelines. These can collect
data from a range of data stores and process or
transform them. This data processing can use the
available Azure-based computer services such as

https://www.red-gate.com/simple-talk/cloud/cloud-development/creating-custom-net-activity-pipeline-azure-data-factory/ 1/18
3/1/2018 Creating a Custom .NET Activity Pipeline for Azure Data Factory - Simple Talk

Hadoop, Spark, and Azure Machine Learning. It can


then publish data to a variety of downstream data
stores. In this article, I will show how to create a
custom .net activity pipeline for Azure Data Factory.

I’ll start by describing a way to create a custom .NET


activity to download les from an FTP server to a
blob storage in Azure. For this we will use Visual
Studio 2015 to create a ClassLibrary for our custom
code, then we will add a DataFactoryApp to the
project so we can create the ADF (Azure Data
Factory) pipeline from Visual Studio so we can
deploy the ADF pipeline directly after we are ready
with the .NET class. We will be using a BatchService
and a compute environment which will run the
custom code on a set of Azure VMs which are
brought on-demand as we schedule the pipeline.

For this article, I will be working with an FTP server


provided by one of the Scandinavian mobile
operators. The FTP server has anonymous access,
and its original purpose is to test the speed of the
internet connectivity. But this is more than enough
for our purpose of building a custom .NET activity.

Note: Azure Data Factory currently supports


an FTP data source and we can use the
Azure portal and the ADF Wizard to do all
the steps, as I will cover in a future article.

The point of this article, however, is to


introduce the reader to the exibility of the
custom .NET pipelines and the possibilities
they present for automating the ADF
deployments from Visual Studio without
introducing unnecessary complexity to the
code. Note also that by creating a custom
.NET activity and ADF deployment from
Visual Studio, we can, if we want, create a
common repository which can be used to

https://www.red-gate.com/simple-talk/cloud/cloud-development/creating-custom-net-activity-pipeline-azure-data-factory/ 2/18
3/1/2018 Creating a Custom .NET Activity Pipeline for Azure Data Factory - Simple Talk

automate and speed up the pipeline


deployments in larger teams.

For this article, I have assumed that the reader


already has an Azure subscription and is familiar
with concepts such as .NET, dll les, FTP requests
and Azure Data Factory.

About the source les


As mentioned above, the source les are stored on
the server of one of the telecom companies:
ftp://speedtest.tele2.net/. There are a range of les
with different sizes in a folder, looking like this:

So, in our custom activity, we will have the class


download the 20MB.zip and store it in our blob
storage.

What we need:
Here is a high-level overview of what we need in
order to setup the custom .NET activity to work:

1. We need to create a Batch Account and a


Batch Pool.

Azure Batch is a platform service which is


used to run an application in the cloud. The
work runs on a managed collection of virtual
machines. For an introduction to Azure Batch,
visit the documentation
https://docs.microsoft.com/en-

https://www.red-gate.com/simple-talk/cloud/cloud-development/creating-custom-net-activity-pipeline-azure-data-factory/ 3/18
3/1/2018 Creating a Custom .NET Activity Pipeline for Azure Data Factory - Simple Talk

us/azure/batch/batch-technical-overview. In
this case, we will create a batch pool with a
very simple Windows server VM to carry out
the FTP download task. In a real use-case, the
Batch service is very exible and can scale the
pool of virtual machines automatically.

2. a Blob storage where the les will be stored

3. a Visual Studio project which will contain the


custom code: This code will be compiled into
a DLL, zipped and then stored in the blob
storage; from where it will be accessed by the
ADF pipeline

4. a Data Factory App, so we can deploy the ADF


directly from our solution (so we don’t have to
manually zip and copy the DLLs to the blob
storage every time we make changes to the
C# code)

5. a Data Factory for the weather data that can


publish the pipeline from there.

6. Debugging and logging of the pipeline


activities. We will create a Batch App to help
us debug the custom .NET activity locally.

So, let’s get started…

Create a Batch Account and a


Batch Pool
The batch Account is a free service whereas the
underlying resources (pools, VMs and storage
usage) are paid for, per the Azure pricing.

To create a Batch Account, simply use the Azure


portal and click ‘New’ ’Compute’ ‘Batch Service’.

The following tab will open:

https://www.red-gate.com/simple-talk/cloud/cloud-development/creating-custom-net-activity-pipeline-azure-data-factory/ 4/18
3/1/2018 Creating a Custom .NET Activity Pipeline for Azure Data Factory - Simple Talk

Enter the information in the required elds and


create the account. Make sure that you create a
linked storage account for the batch service, since it
will be used later during the deployment of our ADF
pipeline from Visual Studio.

Now we can create the Batch Pool by going to the


Batch Service we just created and clicking on the

https://www.red-gate.com/simple-talk/cloud/cloud-development/creating-custom-net-activity-pipeline-azure-data-factory/ 5/18
3/1/2018 Creating a Custom .NET Activity Pipeline for Azure Data Factory - Simple Talk

Pools blade. In the Pools blade, click on ‘Add’ and ll


in the required elds. In this case I am selecting a
Windows 2008 server with the smallest available
resources. In a real production environment, these
settings should be carefully considered, since the
choice will affect the result.

Note the Pool ID after creating it, since it will be


used later in the ADF pipeline.

Create a Blob storage


For this article, we will be using a folder called
ftpdata in a blob storage that we create in our Azure
subscription. For details on how to create a blob
storage, visit the Azure documentation:
https://docs.microsoft.com/en-
us/azure/storage/storage-create-storage-account

Create the custom activity code


First, we need to open Visual Studio (In my case I
am using VS2015) and create a new project of the
type ClassLibrary called FTPDownloader. It is
important to note that the Framework should be
4.5.2.

Secondly, we need to install the NuGet


dependencies by entering the following lines in the
NuGet Package Manager Console (to open the

https://www.red-gate.com/simple-talk/cloud/cloud-development/creating-custom-net-activity-pipeline-azure-data-factory/ 6/18
3/1/2018 Creating a Custom .NET Activity Pipeline for Azure Data Factory - Simple Talk

console, click on ‘Tools’ ‘NuGet Package Manager’


‘Package Manager console’):

Install-Package Microsoft.Azure.Management.Da
taFactories
Install-Package Azure.Storage

As you run the above commands, you will see the


References being added to the project.

After this is done, we are ready to write the custom


code.

using Microsoft.Azure.Management.DataFactories.Mode
ls;
using Microsoft.Azure.Management.DataFactories.Ru
ntime;
using Microsoft.WindowsAzure.Storage;
using Microsoft.WindowsAzure.Storage.Blob;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net;
namespace FTPDownloader
{
public class FTPActivity : IDotNetActivity
{
string ftpUserName = "";
string ftpPassword = "";
public IDictionary<string, string> Execut
e(IEnumerable<LinkedService> linkedServices,
IEnumerable<Dataset> da
tasets, Activity activity, IActivityLogger logger)
{
//Get FTP URL from extended propertie
s
var currentActivity = (DotNetActivit
)activity.TypeProperties;
var extendedProperties = currentActiv
ity.ExtendedProperties;

var ftpFileUrl = extendedProperties[


"FTPFileUrl"];
//Get blob connection string from Ou
put dataset
var outputDataset = datasets.Single(d
ataset => dataset.Name == activity.Outputs.Single(
.Name);
var outputLinkedService = linkedServ
ces.Last(linkedService =>
linkedServ
ce.Name == outputDataset.Properties.LinkedServiceNa
me)
.Propertie
.TypePrope
ties as AzureStorageLinkedService;
var blobConnectionString = outputLin
edService.ConnectionString;
//Get blob folder from Output datase

https://www.red-gate.com/simple-talk/cloud/cloud-development/creating-custom-net-activity-pipeline-azure-data-factory/ 7/18
3/1/2018 Creating a Custom .NET Activity Pipeline for Azure Data Factory - Simple Talk
var outputLocation = outputDataset.P
operties.TypeProperties as AzureBlobDataset;
var folderPath = outputLocation.Folde
rPath;
//Upload file to blob
UploadFileToBlob(ftpFileUrl, ftpUserN
ame, ftpPassword, blobConnectionString, folderPath
logger);
return new Dictionary<string, string>
();
}
public void UploadFileToBlob(string ftpF
leUrl, string ftpUserName, string ftpPassword,
string blobConne
tionString, string blobFolderPath, IActivityLogger
logger)
{
var request = (FtpWebRequest)WebReque
st.Create(new Uri(ftpFileUrl));
request.Credentials = new System.Net
NetworkCredential(ftpUserName, ftpPassword);
request.UseBinary = true;
request.UsePassive = true;
request.KeepAlive = true;
request.Method = WebRequestMethods.Ft
p.DownloadFile;
var outputValue = string.Empty;
using (var ftpResponse = (FtpWebRespo
nse)request.GetResponse())
{
using (var ftpStream = ftpRespon
e.GetResponseStream())
{
logger.Write("connecting to
he blob..");
var outputStorageAccount = C
oudStorageAccount.Parse(blobConnectionString);
var fileName = Path.GetFileNa
me(ftpFileUrl);
var outputBlobUri = new Uri(o
utputStorageAccount.BlobEndpoint,
b
lobFolderPath + "/" + fileName);
var outputBlob = new CloudBlo
ckBlob(outputBlobUri, outputStorageAccount.Credent
als);
logger.Write("uploading to th
e blob URI: {0}", outputBlobUri.ToString());
outputBlob.UploadFromStream(
tpStream);
logger.Write("upload succeede
d");
}
}
request = null;
}
}
}

It is important to notice that the class implements


the Execute method of the IDotNetActivity , and it
expects 4 parameters: input list, output list, a
dictionary for the extended properties and a logger.
It returns a dictionary.

https://www.red-gate.com/simple-talk/cloud/cloud-development/creating-custom-net-activity-pipeline-azure-data-factory/ 8/18
3/1/2018 Creating a Custom .NET Activity Pipeline for Azure Data Factory - Simple Talk

The above code is simply showing how to


implement the interface, get extended properties
from the pipeline JSON, how to get the connection
string from the Output dataset and how to make a
simple call to an FTP server and stream a le
directly to our blob storage.

Debugging and logging


If you are a developer, you might probably know that
there is no perfectly working code from the rst
execution. This means that very often we need to
debug our code and any information available in the
process will speed up our path to successful
execution.

The challenge in this case is signi cant: how do we


debug the custom .NET activity if it can be executed
only in Azure, triggered by ADF and running in the
context of the Batch environment?

There are couple of ways to go about it. One way is


to write custom log messages during each step of
the C# execution in a custom log le.

As you have noticed, the IActivityLogger is used in


the code to write custom messages from the .NET
custom execution to the logs for debugging later.
These custom messages can be viewed in the ‘ADF
Monitor and Manage’ page, by clicking on the
pipeline, then by clicking on its execution slice and
nally by looking at the ‘Log les’ section. There are
two les, one is called system-0.log and the other
one is user-0.log. The former is the default log
generated by the system and the latter is the one we
are writing to from the C# with the _logger.Write()
method.

https://www.red-gate.com/simple-talk/cloud/cloud-development/creating-custom-net-activity-pipeline-azure-data-factory/ 9/18
3/1/2018 Creating a Custom .NET Activity Pipeline for Azure Data Factory - Simple Talk

Another way to do local debugging is to create a


console application in our solution and use it as if it
was the ADF Pipeline, supplying the variables to our
FTPDownloader app.

To do this, just create another console application in


the solution called ADFCustomActivityTest, like this:

Again, we need to install the NuGet dependencies by


entering the following lines in the NuGet Package
Manager Console (to open the console, click on
‘Tools’ ’ NuGet Package Manager’ ‘Package Manager
console’):

Install-Package Microsoft.Azure.Management.Da
taFactories
Install-Package Azure.Storage

Make sure to add a reference to the FTPDownloader


by right-clicking the ADFCustomActivityTest
References tab.

https://www.red-gate.com/simple-talk/cloud/cloud-development/creating-custom-net-activity-pipeline-azure-data-factory/ 10/18
3/1/2018 Creating a Custom .NET Activity Pipeline for Azure Data Factory - Simple Talk

After this is done, the code for the Program.cs is as


follows:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Net;
using FTPDownloader;
using Microsoft.Azure.Management.DataFacto
ries.Models;
using Microsoft.Azure.Management.DataFacto
ries.Runtime;
namespace ADFCustomActivityTest
{
class Program
{
static void Main(string[] args)
{
var ftpActivity = new FTPActiv
ity();
string ftpFileUrl = "ftp://spe
edtest.tele2.net/20MB.zip";
string ftpUserName = string.Em
pty;
string ftpPassword = string.Em
pty;
string blobConnectionString =
"[Your Storage Connectionstring]";
string blobFolderPath = "ftpda
ta";
ftpActivity.UploadFileToBlob(f
tpFileUrl, ftpUserName, ftpPassword, blobCon
nectionString,
b
lobFolderPath, new ConsoleActivityLogger());
Console.WriteLine("Press Any K
ey to Exit...");
Console.ReadKey();
}
}
public class ConsoleActivityLogger : I
ActivityLogger
{
public void Write(string format, p
arams object[] args)
{
Console.WriteLine(format, args
);
}
}
}

Make sure to paste the correct Connection String for


the Blob storage connection in the code above.

The next step would be to set the


ADFCustomActivityTest console app as a startup
project, put breakpoints in the Program.cs and run
the console app.
https://www.red-gate.com/simple-talk/cloud/cloud-development/creating-custom-net-activity-pipeline-azure-data-factory/ 11/18
3/1/2018 Creating a Custom .NET Activity Pipeline for Azure Data Factory - Simple Talk

After you run the app, the breakpoint will be hit and
the local variables will be displayed:

When you click ‘Continue’, the console app will print


out the message that the le was uploaded to the
blob.

Yet another way to go about it is to use Gerhard


Brueckl’s debugger, which allows us to run our
custom .NET activity locally in a debug mode. His
work is on GitHub and he describes the process
here. This is not in the scope of this article; however
I can recommend this way of debugging the custom
activities.

Now that we have tested our code, we can add a


Data Factory App and deploy our code to Azure Data
Factory. Make sure to set the FTPDownloader as a
startup project.

About adding a Data Factory


App to the project
First we need to install the Data Factory tools for
Visual Studio by going to ‘Tools’ ‘Extensions and
https://www.red-gate.com/simple-talk/cloud/cloud-development/creating-custom-net-activity-pipeline-azure-data-factory/ 12/18
3/1/2018 Creating a Custom .NET Activity Pipeline for Azure Data Factory - Simple Talk

Updates’, then ‘Online’ / ‘Visual Studio Gallery’ and


look for Microsoft Azure DataFactory Tools for
Visual Studio.

After we have installed the DataFactory Tools, we


should add the Data Factory App to the project by
clicking on ‘File’’Add’ ‘New Project…’ in Visual Studio.
From the dialog, we select ‘DataFactory’ ‘Empty Data
Factory Project’ and we give it a name
DataFactoryApp.

When this is done, we see that our Solution Explorer


looks like this:

The next steps will be to create

a linked service for the batch activity and for


the blob storage

output table (we need this for the output


results from the pipeline)

the pipeline itself with all extended properties

Creating the linked Services


To create the linked service, we need to right-click
on the LinkedServices folder in the Solution Explorer
and select Azure Batch Linked Service type and give
it a name as
AzureBatchLinkedService_FTPData.json:

https://www.red-gate.com/simple-talk/cloud/cloud-development/creating-custom-net-activity-pipeline-azure-data-factory/ 13/18
3/1/2018 Creating a Custom .NET Activity Pipeline for Azure Data Factory - Simple Talk

After we click ‘Add’, the JSON code for the service


will be presented to us and we need to ll in the
proper values for the Batch Service that we want to
use.

In a similar manner, we create a Storage linked


Service which points to the blob storage that we will
be using.

In the next step we need to edit the JSON to point to


the correct connection string to our blob storage.

Creating the output table


We need to right-click on the Tables folder and
create and output with the Azure Blob type and give
it a proper name:

https://www.red-gate.com/simple-talk/cloud/cloud-development/creating-custom-net-activity-pipeline-azure-data-factory/ 14/18
3/1/2018 Creating a Custom .NET Activity Pipeline for Azure Data Factory - Simple Talk

In the next screen, we need to edit the JSON to point


to the blob linked service we created earlier and to
the correct folder in our blob container.

Composing the Pipeline


As we saw above in our custom .NET class, we have
some references to extended properties, such as:

FTPFileUrl

These extended properties are supplied to the .NET


class at runtime from the ADF pipeline, and in our
case the code for creating the pipeline is as follows
(we need to right-click on the Pipeline folder in the
Visual Studio and add a Custom Pipeline):

{
"$schema": "http://datafactories.schema.
management.azure.com/schemas/2015-09-01/Micr
osoft.DataFactory.Pipeline.json",
"name": "FTPDownloaderPipeline",
"properties": {
"description": "Download FTP Data and
Upload to Blob",
"activities": [
{
"name": "DotNetActivityTemplate",
"type": "DotNetActivity",
"outputs": [
{
"name": "FTPBlobOutput"
}
],
"typeProperties": {
"assemblyName": "FTPDownloader",
"entryPoint": "FTPDownloader.FTP
Activity",
"packageLinkedService": "AzureSt
orageLinkedService_FTPData",

https://www.red-gate.com/simple-talk/cloud/cloud-development/creating-custom-net-activity-pipeline-azure-data-factory/ 15/18
3/1/2018 Creating a Custom .NET Activity Pipeline for Azure Data Factory - Simple Talk
"packageFile": "customactivityco
ntainer/FTPDownloader.zip",
"extendedProperties": {
"FTPFileUrl": "ftp://speedtes
t.tele2.net/20MB.zip"
}
},
"linkedServiceName": "AzureBatchLi
nkedService_FTPData",
"policy": {
"concurrency": 1,
"executionPriorityOrder": "Oldes
tFirst",
"retry": 3,
"timeout": "00:05:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
}
}
],
"start": "2017-07-01T00:00:00Z",
"end": "2017-07-01T00:00:00Z"
}
}

In the above script, we need to enter the proper


values for the storage account and folder. Also,
make sure to add a reference to the FTPDownloader
in the DataFactoryApp, otherwise we will get an
error that the reference is missing.

Keep in mind that the pipeline can be extended with


the partionedBy functionality to dynamically assign
variables for DateTime and Text, which can be used
to traverse subfolders, concatenate le names etc.
For more details, visit the ADF pipeline
documentation: https://docs.microsoft.com/en-
us/azure/data-factory/data-factory-functions-
variables

About deploying the pipeline


To deploy the Data Factory items created in the
steps above, we need to right-click on the
DataFactoryApp and click ‘Publish…’, which will bring
us to the publish wizard. In the wizard, we need to
specify the Data Factory we want to use and what
les we want to copy.

https://www.red-gate.com/simple-talk/cloud/cloud-development/creating-custom-net-activity-pipeline-azure-data-factory/ 16/18
3/1/2018 Creating a Custom .NET Activity Pipeline for Azure Data Factory - Simple Talk

After we go through the wizard, and assuming that


we have con gured everything properly, we will get
the published ADF pipeline in Azure. Then we can go
to the portal and monitor its execution.

Conclusion
In this article, we covered the topic of setting up the
Azure environment and running a custom .NET
activity to download le from a FTP to a blob
storage in our Azure environment. We learned how
to debug our code locally and we also learned how
to create and deploy the ADF pipeline from the
Visual Studio environment. This is a very exible
way to extend the functionality of ADF and it gives
us the ability to write custom logic.

In the above example, the code is just a glimpse into


the possibilities presented by the custom .NET
activities and not really a production-ready code
which performs magic.

It is much more important to focus on the building


blocks of the process of composing the data
integration: from deciding what batch pool to use,
how to scale it, what inputs and outputs to use, how
to slice the extended properties dynamically, and
nally to automatically deploying the pipeline with a
single-click from Visual Studio.

https://www.red-gate.com/simple-talk/cloud/cloud-development/creating-custom-net-activity-pipeline-azure-data-factory/ 17/18
3/1/2018 Creating a Custom .NET Activity Pipeline for Azure Data Factory - Simple Talk

https://www.red-gate.com/simple-talk/cloud/cloud-development/creating-custom-net-activity-pipeline-azure-data-factory/ 18/18

You might also like