Wednesday, May 30, 2012

Extracting Data From Multiple Excel Files Having Multiple Data Sheets

Let us consider the case where there is number of MS-Excel files and each files have a number of sheets in them. And we have to export all these data to a common repository.
To accomplish this task we will use For Each Loop Container to load all these Excel sheet data.
Below are the steps that need to be followed to achieve this task through SSIS.Step 1: Create a new SSIS Package and Rename it to “Export Dynamic Excel File with Multiple Sheets.dtsx”


Step 2: Create Connection Manager for Source (Excel Connection Manager). Perform the following task:
  • Right click on the Connection Managers pane and choose “New Connection…” from context menu, Select “EXCEL” and click “Add” the “Excel Connection Manager” window will be opened.
  • On Excel Connection Settings Pane specify any one of the excel file name among all other files that need to be exported.
  • Check the checkbox that says “First rows has a column name
  • Rename this Connection manager as “Excel Connection Manager - Source”.

Fig 1: Excel File structure needs to be exported.
The below figure (Fig 2) shows the Data in Excel sheet of file AU.xls. AU.xls file has 2 sheets in it and similarly all other excel files in above figure have 2 or more sheets in them. So we have to export the data present in various sheets of above 4 Excel files shown in Fig 1.



Fig 2: Excel Sheet Structure

Step 3:
Create a variable named “CurrentExcelSheet” of String type with the default value “Queensland$” Which is one of the Sheet name in the above Excel File. By default Excel File has sheet1, sheet2 and sheet3 in it. So Provide the proper sheet name which must exists in the specified Connection Manger Excel File. In this sample example the Excel Manger is set to "AU.xls" and "Queensland
" is one of the sheets of Excel file “AU.xls”.  It contains 2 sheets in it and I have specified one of them. If specified Sheet name does not exist then it will through an error.
Step 4: Drag and drop “Foreach Loop Container” in Control Flow area. Set the properties as follows:

  • General: Change the Name to “Foreach Loop Container- Excel File Iterator”.
  • Collection: Set the “Enumerator” in Collection pane to Foreach File Enumerator. Refer to Figure 3.

    In addition to this make the following Enumerator Configuration settings:
    • Folder: Specify the Folder path where all the excel files are stored which has to be exported to the common repository. Here the folder path specified is same as shown in Figure 1 (address bar).
    • Files: Specify the file name as “*.xls” which will limits to excel files only and will not go to read any other file type even if they are present in the specified folder. We can specify "*.*" here, if we are very much sure that all the files must be of ".xls" type.


      Fig 3: Foreach File Enumerator Configuration
  • Variable Mappings: Map the file name to a particular variable. So that on each iteration different files will be assigned to this variable. So, in drop down click <New Variable…>, a new window “Add variable” will be open and declare the variable as “FileName”. In “Value” textbox specify (Fig: 5) the full path of one of the excel file that has to exported which will be the default value for this variable.



    Fig 4: Foreach loop Enumerator Variable Mapping





    Fig 5: Add Variable

And finally click “OK” to close the Add variable window. Variable Mapping tab will look like below after configuration. Set the Index of this variable to 0.






Fig 6: Variable Mapping
Click “OK” to close the “Foreach loop Editor”.
The added “Foreach Loop Container” Will iterate through each excel files placed in the folder (the folder path specified on Collection tab above). Now, On each iteration a new file name will be assigned to the variable but at the same time we need to iterate through all the Excel sheets present in the assigned Excel File to extract the data. So, to iterate through each excel Sheets of the selected Excel file we will take another “Foreach Loop Container”.
Step 5: Drag and drop another Foreach Loop Container and place it inside the Existing “Foreach Loop Container – Excel File Iterator”. The below figure shows 2 Foreach Loop Container paced one inside another. The Outer Foreach Loop Container will iterate through the Excel Files and the inner will iterate through the available Sheets of the selected Excel file hold by the outer Foreach Loop container.




Fig 7: Package Layout
Make the following configuration to the newly placed Foreach Loop Container:
  • General: Rename it to Foreach Loop Container - Iterate through Excel Sheet.
  • Collection: Set the “Enumerator” in Collection pane to Foreach ADO.NET Schema Rowset Enumerator. In addition to this make the following Enumerator Configuration:



Fig 8: Foreach ADO.NET Schema Rowset Enumerator
Connection: Create a new connection of type ADO.NET Connection. Click “New Connection…” to open the connection Manager window. 


Connection Manager: Make the following configuration on Connection Manager Window


Provider:  Microsoft Jet 4.0 OLEDB Provider as shown in figure 10.


DataBase file name: Specify the Database file name which is the path of the excel file. So specify the full path of one of the Excel file that has to be exported to common repository. Leave the user name and password textbox as it is.





Fig 9: Connection Manager
All: Click All (Figure 9) and Set the Extended properties as follows:





Fig 10: Advanced setting for Excel Connection
The value “Excel 8.0” for Extended Properties is specified to Excel 97-2003 files. If you are dealing with Excel 2007 file you need to use “Excel 12.0” in Extended properties.

Test the connection and close the connection manager. The connection Manager will look like figure 11.




Fig 11: Connection Manager Window

Rename the newly created ADO.NET connection as ADO.NET Connection for Excel. Make sure that this newly created ADO.NET Connection for Excel is selected in the Connection dropdown of Collection Tab shown in figure 8.

Schema: In Schema dropdown select Table.

  • Variable Mappings: We have created the variable in Step3 CurrentExcelSheet. Set variable name as CurrentExcelSheet and under index mention 2. 2 here hold the index value which will hold the table name so in case of excel it will be the Excel Sheet name. The Excel Sheet will be assigned to this variable.





    Fig 12: Foreach Loop Variable Mapping

Step 6: The “Excel Connection manager – Source” created in Step2 has to me modified as we have to iterate through each of the excel files specified in the folder. To make this connection dynamic go to the Properties Window of the “Excel Connection Manger – Source” and select “Expression” (shown in figure 13) and on the button click to open Property Expression Editor. Under "Property" dropdown choose “ConnectionString” and click the ellipse Button right of the “Expression” textbox to write the expression for it.



Fig 13: Property Expression
An Expression Builder window will be open to write the expression. It has 3 sections in it, the first section specifies the Variables which contains the System as well User defined variable and the second section contains inbuilt functions. The third section says Expressions where we write the Expression. So under this textbox write the following the expression for Connectionstring.
"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + @[User::FileName] +";Extended Properties=\"Excel 8.0;HDR=YES\";"
Click “Evaluate Expression” button to evaluate the expression written for the ConnectionString. Click “OK” to close the Expression window.




Fig 14: Expression Builder

Here the @[User::FileName] is the variable created to hold the name of Excel file. So on each iteration a new excel file will be assigned and hence it will hold dynamic value in it.
Step 7: We have already created the Connection manager (ADO.NET Connection for Excel) for ADO.Net while configuring the “Foreach Loop Container - Iterate through Excel Sheet”. Remember that we have specified the Physical path for the “ADO.Net Connection for Excel” connection manager. Since we have to loop through the excel sheets of the select excel file we have to make the connection dynamic so that on each iteration a new Excel Sheet file is allocated to this connection manager.So we will make the dynamic connection for the “ADO.NET Connection for Excel”.  To do this we will repeat the same process that we have done in Step 6 for the dynamic connection string. This will also have the same line of Expression as:
"Provider=Microsoft.Jet.OLEDB.4.0;Data Source="+ @[User::FileName] +";Extended Properties=\"Excel 8.0;HDR=YES\";"
This will make the connection dynamic for ADO.NET Connection for Excel.
Step 8: Now Drag and Drop the Data Flow Task inside the “Foreach Loop Container - Iterate through Excel Sheet”. Double click Data Flow Task and will navigate to “Data Flow” tab.
Step 9:  Drag and drop the “Excel Source”. Make the configuration as specified in the below figure.
Data access mode- In Data access mode dropdown select Table name or view name variable as shown in Figure 15. Since on each iteration the table name (here Excel sheet) is being assigned from a variable.


Variable name- Specify the name of variable which holds the name of Excel sheets and not the Excel file names.





Fig 15: Excel Source Editor

Click on the “Columns” tab to check whether the Columns Names specified are reflected or not. Click “OK” to close this window.
The source connection manager will pull the data from the Excel source that we have specified earlier. Now we will take another excel file as destination where all the data from various excel files having multiple sheets will be pushed into.
Step 10:  Now we are going to hold all the data being extracted from Source to another Excel file. So we have to create a separate Excel Connection manager for the destination and name it to “Excel Connection Manager”.
Step 11:  Drag and Drop “Excel Destination” on Data Flow. Extend the Green arrow of “Excel Source” created in Step 9 and attach to the “Excel Destination”. Make following Configuration:

Data access modeRemember here Data access mode is set to Table or View. Here the data is being sent to a particular table which will hold the data from various sheets of excel files.





Fig 16: Excel Destination Editor
  • Connection Manager: Specify the Destination connection manager create din step 10 under OLEDB Connection Manager.
    • Data Access mode: Set to Table or View
    • Name of the Excel Sheet: Click on New to create new table or select existing table from dropdown.



      Fig 17: Create Table
  • Mappings: Map the Columns of source to Destination.
Step 12: Execute the package.



Fig 18: Package Execution
Conclusion:  We have extracted the data from the excel sheet and placed them to the excel destination as final data. We can take different destination to hold the data, only we need to change the destination connection String. So, you can play with the data stored in the number of excel files having excel sheets in it and see how it works.

No comments:

Post a Comment