Full Text

Research Article

Managed File Transfer Protocal


Abstract

This paper is to let new users to get familiar with Power BI and be able to create jobs in GoAnywhere.

 

Keywords: GoAnywhere is centralize and secure file transfers tool with ease, streamline processes. It provides centralized control and improves the quality of your file transfers and help the organization to comply with data security policies and regulations. It is a managed file transfer and encryption tool used by Mid-sized companies, large enterprises, and government entities to protect sensitive data.

 

1. Introduction

This tool is used for any function to be performed with files. A Business Process which involves Ftp’ing of files to an outside server or uploading of a document onto another server, Encrypt or Decrypt a file, create a trigger file, we make use of GoAnywhere.

 

GoAnywhere (GA) practically has interrelationship with across applications being an important tool to deal with file operations. These jobs either be scheduled to run at a particular time or could be called on to run during a Informatica Feed execution.

 

Since the Job deals with files movement across applications in the system, it holds the critical place in the business.

 

2. Functional Overview

Resources are the names and connection properties of the servers, and other data sources, which GoAnywhere can interact with. Users with the Resource Manager role can create and edit Resources. The defined Resources can be used within GoAnywhere Projects by choosing the Resource names from drop-down lists.

Listed below are the Resource types that have been defined in GoAnywhere.

 

Azure Blob Storage: Azure Blob Storage provides organizations with affordable and scalable cloud storage provided by Microsoft. GoAnywhere supports Block type blob storage.

 

Network Shares: The names and connection properties of shared file locations on a network to which GoAnywhere can connect for accessing files.

 

Database Servers: The names and connection properties of the database servers to which GoAnywhere can connect. Database servers supported include DB2 for IBM i (iSeries), DB2, Oracle, SQL Server 2000 and later, Sybase, Informix, PostgreSQL, and MySQL.

 

FTP Servers: The names and connection properties of the FTP servers to which GoAnywhere can connect for sending and receiving files.

 

FTPS Servers: The names and connection properties of the FTPS servers (FTP over SSL) to which GoAnywhere can connect for sending and receiving files.

SSH Servers (SFTP/SCP/SSH): The names and connection properties of the SFTP servers (FTP over SSH) to which GoAnywhere can connect for sending and receiving files.

 

SMTP Servers: The names and connection properties of the SMTP servers (Mail Servers) to which GoAnywhere can connect for sending emails.

 

OpenPGP Key Rings: The names of the OpenPGP Key Rings that can be used in GoAnywhere to encrypt and decrypt files using the OpenPGP standard.

 

Permissions (authorities) for a Resource can be granted to individual users and Groups of Users. The following three types of permissions can be granted:

 

Permission Description: Read - Allows any user with the Resource Manager role to view the settings for the Resource.

 

Write: Allows any user with the Resource Manager role to change the Resource settings or permissions, or to delete the Resource.

 

Use: Allows users to utilize the Resource when executing a Project.

 

For instance, you may have a FTP server resource that only certain users should be able to utilize (connect to). If you additionally do not want these users to change the settings on the FTP server, then you would give the users the permission of Use only to that Resource.

 

Module: A Module is a logical grouping of one or more Tasks. For instance, a module may be defined with three Tasks to be executed in sequential order. The first Task in the Module may be to create a Work Space, second task calculate end time and third task to connect to source path and create a file list. Multiple modules can be defined in a Project. A Module can pass control to another Module based on certain conditions.

 

For instance, you could have a main Module to perform a series of tasks. If any errors are encountered in the main Module, you could have control passed to another module that sends an error notification to an email address or perform some other Tasks. The Call Module task can execute another Module in a Project to perform a sub-routine of tasks based on specific criteria.

 

IF Condition: An IF Condition controls if a block of tasks will run if a condition is met. The IF Condition is like a Yes/No decision point in a process flow chart.

 

Loops: Loops are complex components that repeat a set of Tasks on the data or files produced in a Project.

For instance, the data in a spreadsheet needs to be read into a database. A Loop contains the Tasks that will read a row of data, evaluate and process it and then iterate to the next row of data and repeat the Loop.

 

Task: A Task is a discrete business process to perform. For instance, a Task may write data to an XML document, send an email, import data from an Excel file, or FTP a file. GoAnywhere includes dozens of different Tasks to choose from when building a Project. There is no logical limit to the number of Tasks that can be defined in a Project.

 

Element: A Task is made up of one or more Elements, which describes the types of work to perform within the Task. For example, the FTP task has an Element to put a file on a FTP server, an Element to get a file from a FTP server, an Element to rename a file on a FTP server, and so on. As another example, the Send E-Mail task has an Element to indicate the attachments to send.

 

 

Figure 1: Flow.

 

3. Technical Overview

We have within the GA GUI, which are Resource, Projects, scheduler, Active Jobs, Completed Jobs, Users, and groups. The Development team has access to Projects, scheduler, Active Jobs, Completed Jobs.

 

Resources are the section wherein the resource utilized within the project are being set up. Projects are the section wherein you will be able to find all the existing projects we have. This is the section where we develop any new projects required to support the business. Active and Completed section are the areas where we can view the jobs that are in execution or the once that have completed their execution.

 

 

Figure 2: Dashboard.

 

Pub/Sub Pattern:

Publication Process: The publication process includes retrieving the data from the publisher, running the publication mapping, and writing the data to the relevant topic in the publication repository. After the publication process ends, each subscriber consumes the published data according to the schedule and the filter that you define when you create the subscription. In addition, in our process after every publication a Mapping task subscription will be triggered to load the Topic table’s data into Datastore.

 

**As part of this solution to increase reusability and to mitigate any future risks if we add/remove any attributes – Mappings are considered instead of DSS tasks.

 

The following flow shows the main stages of the publication process for publications:

 

 

 
Figure 3: publication process.

 

Subscription Process: The subscription process includes retrieving the required data from the Data Integration Hub publication repository, running the subscription mapping, and writing the data to one or more subscriber targets. Data Integration Hub keeps the data in the publication repository until the retention period of the topic expires.

 

As soon as all the Topics and topic tables are published, all the data from these topics and topic tables will be subscribed into the Data Store, a persistent database on Azure.

 

For all the applications that are being configured to run the APIs for Subscription, the tasks will retrieve data from Datastore but not from Data Hub.

 

The following image shows the main stages of the subscription process for each subscription:

 

 


 
Figure 4: Processes for Subscription.

 

Integration Tasks without Topics/Topic Tables (ETL Pattern): Informatica Data Integration or Application integration tasks (mappings, task flows, etc.) will be developed and configured to run on the source data and the target data is staged in the Data Store Stage database on Azure for further processing. This route will be taken for application data which are not used downstream, and which are only for application specific.

 

The following flow shows the main stages of how the DT Data Store Stage is integrated within the process:

 

 

 Figure 5: Processes for topic.

 

Integration Tasks for Real Time applications: For all the real time feeds, Informatica Data Integration or Application integration tasks (mappings, task flows, etc.) are considered, but Power Exchange CDC comes into consideration as all the real time feeds are dependent on the condensed files for capturing few attributes, which will be consumed by other applications.

 

4. Handling CDC & Full Load Process

All CDC logic will be built from Source to CDC Topics in their Publication.

Bound Subscription will subscribe that data through a Mapping task from Topics to Datastore

Bound Subscription will subscribe the data through mapping task from Topics to the Applications.

A Publisher for Full load will be loaded from Datastore to Full Load Topics

Unbound Subscription will subscribe that data from Topics.

 

 

 Figure 6: Processes for full load topics.

 

5. Initialize and Generate Automatic Parameter File

This task takes the process name and application name parameter and executes the batch job to call the stored procedure, which in turns does the following activities.

While inserting a record in Process table, it checks for the below conditions and picks up the respective values to insert:

Checks for IsOverrideETLProcessID (default is 0), for each task from TaskParameter table.

If IsOverrideETLProcessID=0, then Identity column in Process Table: ProcessID= ETLProcessID in Process table.

If IsOverrideETLProcessID>0, it picks up the mentioned ETLProcessID from the TaskParameter table.

Checks for the override parameter value (IsOverride) for each task from TaskParameter table (default is always 0, 1 means override)

If IsOverride=0, then the latest TaskStartDate for that Task from Task Table where Status=’Success’ is updated into TaskParameter Table as SourceExtractStartDate. And Then, SourceExtractEndDate=getdate ().

If IsOverride=1, it picks up the mentioned SourceExtractStartDate and SourceExtractEndDate from the TaskParameter table.

Generates the dynamic parameter file by merging static file and the output of TaskParameter table

For PUB/SUB MCT’s, the variables are set at the mapping level itself. (SourceExtractStartDate, SourceExtractEndDate, ModifiedDate, ETLProcessID)

 

6. Delta Detection

When the data in a source system is frequently updated, it is necessary to capture the updated information to the target extracts. However, due to high volume and load window, it is desirable to consider only the updated delta information, rather than reloading the entire source table. Usually, there is a Modified timestamp column in the source table. This column can be used to filter the source records, based on the last source extract end time of that task.

 

This is implemented using a common framework component to generate a dynamic parameter file for each task flow, based on the previous successful execution status in the Task table, Stored procedure picks the SourceExtractEndDate for that task and updates the TaskParameter table SourceExtractStartDate = Task. SourceextractendDate and SourceExtractEndDate = GETDATE.

 

7. Folder Structure

The below Project structure facilitates the coordination and implementation of data hub using different services in IICS. Its main reason is to create an environment that fosters interactions among the team members with a minimum number of disruptions, overlaps and conflict.

The structure of the directories created in the repositories always follows the below hierarchy.

Application Name

API (Process, Process Objects, etc.)

ETL (Mappings, Mapping tasks, DSS tasks, etc.)

Orchestration

Publisher (All mappings that are in the Publications for the application)

Subscriber (All mappings that are in the Subscriptions for the application)

 

 


 
Figure 7: Folder structure

 

Folder structure in integration hub: In Integration hub, assets like Application, Topic, Publisher, and subscribers are managed.

 

Folders in CIH are predefined based on Applications, Topics, Publications and Subscriptions. There is no provision to update folder structure manually.

 

8. Version Control

Version control is not available in Informatica Cloud. All developers must save their changes religiously and always make sure to take backups regularly.

 

Run team is working on a solution for Version Control which is out of scope for this project.

The below is the proposed solution for code management. Even though it is not in scope of this project, below is the future roadmap for handling code version management in IICS as the tool doesn’t provide any version control.

 

 

 Figure 8: Deployment flow.

 

9. References

        1. https://www.informatica.com/blogs/welcome-to-informatica-intelligent-cloud-services.html
        2. https://now.informatica.com/IICS-Cloud-Data-Integration-Services-onDemand.html