PowerShell-Specific Code Smells: Building output using +=

Before I cover this specific code smell, I should probably explain one thing. The presence of code smells doesn't necessarily mean that the code in question isn't functional. In the example I gave last time (the extra long method), there's no reason to think that just because a method is a thousand lines long that it doesn't work. There are lots of examples of code that is not optimally coded that works fine nonetheless. The focus here is that you're causing more work: Either up-front work in that the code is longer or more complicated than necessary, or later on, when someone (maybe you?) needs to maintain the code.

With that said, we should talk about aggregating output using a collection object and the += compound assignment operator. This is such a common pattern in programming languages that it's a hard thing not to do in PowerShell, but there are some good reasons not to. To help understand what I mean, let's look at some sample code.


function get-sqlservices {

param($computers)

    foreach ($computer in $computers){

           $output+=get-wmiobject -class Win32_Service -filter "Name like 'SQL%'"

    }

    return $output

}
$mycomputers='localhost','127.0.0.1',$env:COMPUTERNAME

measure-command{

get-sqlservices -computers $mycomputers | select -first 1

}

Before we discuss this code let me be clear: this is not great code for several reasons. For the purposes of discussion, though, let's just look at how the output is handled. As I mentioned, this is how you'd do something like this in most programming languages and it works fine. On my laptop it ran in 723 milliseconds. If we change the list of computers to a longer list it takes considerably longer:


$mycomputers=('localhost','127.0.0.1',$env:COMPUTERNAME) * 100

Days : 0
Hours : 0
Minutes : 0
Seconds : 51
Milliseconds : 486
Ticks : 514863437
TotalDays : 0.000595906755787037
TotalHours : 0.0143017621388889
TotalMinutes : 0.858105728333333
TotalSeconds : 51.4863437
TotalMilliseconds : 51486.3437

Changing the function to send the output to the pipeline looks like this:


function get-sqlservices2 {

param($computers)

    foreach ($computer in $computers){

           get-wmiobject -class Win32_Service -filter "Name like 'SQL%'"

    }

}
$mycomputers=('localhost','127.0.0.1',$env:COMPUTERNAME) * 100

measure-command{

get-sqlservices2 -computers $mycomputers | select -first 1

}

Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 478
Ticks : 4782609
TotalDays : 5.53542708333333E-06
TotalHours : 0.00013285025
TotalMinutes : 0.007971015
TotalSeconds : 0.4782609
TotalMilliseconds : 478.2609

The code doesn't look much different. The only changes are that we're not assigning the output of the get-wmiobject cmdlet to anything and we don't have an explicit return. This is a point of confusion to most people who come to PowerShell from a traditional imperative language (C#, Java, VB, etc.). In a PowerShell script, any value that isn't "captured" either by assigning it to a variable or piping it somewhere is added to the output pipeline. The "return value" of the function is the combination of all such values and the value in a return statement (if present). So in this case, the output of the new function is the same as the output of the second. Changing it to use the pipeline didn't change the value at all. So why is this considered a code smell? The reason is that the second script runs faster than the first did. In fact, it runs faster with 300 computers (100 copies of the list of 3) than the first did with 3 computers. Why is it so much faster? In PowerShell 3.0, the implementation of select-object was changed to stop the pipeline after the number of objects requested in the -first parameter. In other words, even though we passed 300 servers to the function, it stopped after it got the first result back from get-wmiobject from the first server.
You're not always going to be using -first, but even when you're not the values in the pipeline are available to downstream cmdlets before the function is done completing (if you don't use +=). If you're simply sending the output to the console you will begin to see the results immediately rather than having to wait. Another issue arises when your aggregating function has throws an exception before it's done. If you didn't hit the return statement, you won't see any results at all. Being able to see the results up to the point of the error will probably help you track down where the error was. What if there were thousands of servers (or your dataset was considerably larger for some other reason)? Your process would eat memory as it built a huge collection. With pipeline output there's no reason for the process to be using much memory at all. Finally, with pipeline output there's one less thing to keep track of. One less variable means one less place to make a mistake (accidentally use = at some point instead of +=, misspell the variable name, etc.).

I hope you can see that with PowerShell, following this common pattern is not a good thing.

Let me know what you think.

Mike