Before I cover this specific code smell, I should probably explain one thing. The presence of code smells doesn’t necessarily mean that the code in question isn’t functional. In the example I gave last time (the extra long method), there’s no reason to think that just because a method is a thousand lines long that it doesn’t work. There are lots of examples of code that is not optimally coded that works fine nonetheless. The focus here is that you’re causing more work: Either up-front work in that the code is longer or more complicated than necessary, or later on, when someone (maybe you?) needs to maintain the code.
With that said, we should talk about aggregating output using a collection object and the += compound assignment operator. This is such a common pattern in programming languages that it’s a hard thing not to do in PowerShell, but there are some good reasons not to. To help understand what I mean, let’s look at some sample code.
function get-sqlservices { param($computers) foreach ($computer in $computers){ $output+=get-wmiobject -class Win32_Service -filter "Name like 'SQL%'" } return $output } $mycomputers='localhost','127.0.0.1',$env:COMPUTERNAME measure-command{ get-sqlservices -computers $mycomputers | select -first 1 }
Before we discuss this code let me be clear: this is not great code for several reasons. For the purposes of discussion, though, let’s just look at how the output is handled. As I mentioned, this is how you’d do something like this in most programming languages and it works fine. On my laptop it ran in 723 milliseconds. If we change the list of computers to a longer list it takes considerably longer:
$mycomputers=('localhost','127.0.0.1',$env:COMPUTERNAME) * 100
Days : 0
Hours : 0
Minutes : 0
Seconds : 51
Milliseconds : 486
Ticks : 514863437
TotalDays : 0.000595906755787037
TotalHours : 0.0143017621388889
TotalMinutes : 0.858105728333333
TotalSeconds : 51.4863437
TotalMilliseconds : 51486.3437
Changing the function to send the output to the pipeline looks like this:
function get-sqlservices2 { param($computers) foreach ($computer in $computers){ get-wmiobject -class Win32_Service -filter "Name like 'SQL%'" } } $mycomputers=('localhost','127.0.0.1',$env:COMPUTERNAME) * 100 measure-command{ get-sqlservices2 -computers $mycomputers | select -first 1 }
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 478
Ticks : 4782609
TotalDays : 5.53542708333333E-06
TotalHours : 0.00013285025
TotalMinutes : 0.007971015
TotalSeconds : 0.4782609
TotalMilliseconds : 478.2609
The code doesn’t look much different. The only changes are that we’re not assigning the output of the get-wmiobject cmdlet to anything and we don’t have an explicit return. This is a point of confusion to most people who come to PowerShell from a traditional imperative language (C#, Java, VB, etc.). In a PowerShell script, any value that isn’t “captured” either by assigning it to a variable or piping it somewhere is added to the output pipeline. The “return value” of the function is the combination of all such values and the value in a return statement (if present). So in this case, the output of the new function is the same as the output of the second. Changing it to use the pipeline didn’t change the value at all. So why is this considered a code smell? The reason is that the second script runs faster than the first did. In fact, it runs faster with 300 computers (100 copies of the list of 3) than the first did with 3 computers. Why is it so much faster? In PowerShell 3.0, the implementation of select-object was changed to stop the pipeline after the number of objects requested in the -first parameter. In other words, even though we passed 300 servers to the function, it stopped after it got the first result back from get-wmiobject from the first server.
You’re not always going to be using -first, but even when you’re not the values in the pipeline are available to downstream cmdlets before the function is done completing (if you don’t use +=). If you’re simply sending the output to the console you will begin to see the results immediately rather than having to wait. Another issue arises when your aggregating function has throws an exception before it’s done. If you didn’t hit the return statement, you won’t see any results at all. Being able to see the results up to the point of the error will probably help you track down where the error was. What if there were thousands of servers (or your dataset was considerably larger for some other reason)? Your process would eat memory as it built a huge collection. With pipeline output there’s no reason for the process to be using much memory at all. Finally, with pipeline output there’s one less thing to keep track of. One less variable means one less place to make a mistake (accidentally use = at some point instead of +=, misspell the variable name, etc.).
I hope you can see that with PowerShell, following this common pattern is not a good thing.
Let me know what you think.
Mike
Honestly, I think there are times when building output that way makes more sense, such as if you’re building an automated script. The script itself should be built to capture and report on the errors it runs in to, and you won’t be calling it with “select -first 1” (except perhaps during testing, which you could write a limited-scope test for it so it doesn’t take long anyway).
As for building functions and tools for the command line, though, I agree, though when you get into more complex output, there needs to be a good replacement. For instance, building output dynamically from multiple sources. For that, you should be creating output objects for each input and writing them out as they are completed.
I do feel that you should control what gets returned rather than letting the shell magic take care of it for you; using “return” for that, though, is not the way, since this has to return everything at once. The proper way to control the output is to create an object, assign it properties, and write it out with Write-Output.
A couple of points. First, I don’t see any difference in writing tools for the command-line and for automated scripts. I try to write all of my tools the same way and can use them from either place.
Second, using “return” doesn’t return everything at once. It simply adds whatever value you “returned” to the output stream and exits the function. “return X” is indistinguishable from “X; return”.
I’ll admit, I’ve never actually used return in Powershell before, as I learned different (and I believe better) ways to do it before learning it exists. My apologies if I’m misrepresenting what it does.
As for the automated script, I’m more thinking of the “control” part of the script that executes commands and collects output. Maybe I’m just doing it badly, though that’s how I’ve learned to do them. There’s never anything wrong with writing things to be reused, though.
I don’t think I said that quite right, so let me clarify: I believe everything should be written for reuse and as generalized as possible. Something I’m having an automated script do, however, may never have any use on the command line by itself.