Data Profiling with Windows PowerShell

A customer asked me the other day about a method to do some data profiling against a file system.  So I thought I would share the request, my suggestion and little PowerShell script I crafted to do some data profiling.  The request read as follows:  “Do you know of any tools that will list all files in a directory, and all subs and provide attributes like filename, path, owner, create date, modify date, last access date and maybe some other attributes in report format?”

My recommendation was a commercial product called TreeSize Professional by JAM Software the product license cost is ~ $50 and worth every penny, scan speeds are good, supports UNC paths and reporting is intuitive.  Overall an excellent product.

As an alternative below is a quick PowerShell script (also attached to post as data_profile.ps1) that will create a CSV file with data profiling information, once the CSV file is create the CSV can be opened in Excel (or your spreadsheet tool of choice) or imported into a DB and manipulated.

The above script scans all files recursively starting at c:\files and outputs the results to results.csv.  One thing to note is that the scan stores all data in an array in memory, the is because the PowerShell Export-Csv function does not support appending to a CSV file (you gotta wonder what Microsoft talks about in design meetings).  I will likely create a version of the script that uses the out-file function to write each row to the csv file as the scan happens rather then storing in memory until the scan is completes and then writing the entire array to the report.csv file, goal here is to reduce the memory footprint during large scans.

The output of this file script will be similar to the following:

Please follow and like us:

2 thoughts on “Data Profiling with Windows PowerShell

  1. Nice work. For PowerShell V2 you can work it like this.

    Function Get-FileDetails ($target) {    
    get-childitem $target -recurse| Select-Object FullName, Root,
    Directory, Parent, Name, Extension, PSIsContainer, IsReadOnly,
    Length, CreationTime, LastAccessTime, LastWriteTime, Attributes }
    foreach($file in  (Get-FileDetails
    C:\temp\ConsoleApplication2) ) {     $acl =
    get-acl $file.fullname | select-object
    path,owner,accesstostring,group    
        $properties = @{}    
        $properties.Root = $file.Root
        $properties.Directory = $file.Directory
        $properties.Parent = $file.Parent
        $properties.Name = $file.Name
        $properties.Extension = $file.Extension
        $properties.IsDIR = $file.PSIsContainer
        $properties.IsReadOnly = $file.IsReadOnly
        $properties.Size = $file.Length
        $properties.CreationTime = $file.CreationTime
        $properties.LastAccessTime =
    $file.LastAccessTime     $properties.LastWriteTime
    = $file.LastWriteTime     $properties.Attributes =
    $file.Attributes     $properties.Owner =
    $acl.owner     $properties.AccessToString =
    $acl.accesstostring     $properties.Group =
    $acl.group        
        New-Object PSObject -Property $properties }
     

  2. I have not run the PS2 code but does it output the results to a CSV files?  It does not look like it.  The problem that I am trying to solve is the memory footprint when the scan runs using an array to store output and then write the array to a CSV files.  I would like to write each row and remove the use of the array, the issue that I have is that the Export-CSV function does not support appending to a CSV (I don't think this is supported on PS2 either), I did find a replacement Export-CSV function here http://dmitrysotnikov.wordpress.com/2010/01/19/export-csv-append/ which says it supports an -Append flag but have not had a chance to try yet.  I tried using the out-file function in the PS1 script but formatting was a little mote work than what I bargained for.

Leave a Reply