Text processing one-liners: Ruby vs. Awk
Since my review of a book about Awk, I've been thinking a lot about text processing in the Unix stream-oriented workflow. Before learning Awk, I used Sed for easy text substitution stuff, yet my go-to language for text processing was Ruby: I have written many small programs that simply iterate over each lines of a stream, processing. However, hailing from Perl, which hailed from Awk, Ruby is fully featured for text processing directly from the command line. As I was checking out its array of command-line options, the methods of its Kernel module and its multitude of predefined global variables, I wondered if those Awk one-liner programs would be just as easy and concise in Ruby.
So I verified by translating all of Eric Pement's Awk one-liners to Ruby one-liners. I come away with multiple tricks and observations as to the respective behaviors of Ruby and Awk, so let me show you them.
By the way, I know that most people rather use Perl than Ruby for command-line text processing with the power of a general purpose programming language. Well, I don't know any Perl, please don't tell anyone. Moreover, this subject will be covered in an upcoming e-book by Peteris Krumins, the author of two e-books that explain one-liner programs in Awk and Sed, respectively. I wish to add a bit to this bright fellow's wisdom, so let us carry on about Ruby.
While Awk has few command line options and a single invocation profile, Ruby has multiple invocation types, all of which determine the structure of the text processing program the one-liner can do. For Awk, all programs have the structure that each line (record, actually, but all of Eric Pement's one-liners concern themselves with a division of a text file along line breaks) is divided in fields and are then processed by each program clause in succession, when the line or its fields satisfy the clause's condition. By default, the clause's condition is always true, so it is satisfied by all lines. Also, by default, the clause's body is to print the current line.
In contrast, the basic structure of command-line Ruby programs depends on multiple invocation flags. The most basic invocation is
ruby -e '<PROGRAM>' file1 file2 ...
which simply runs PROGRAM. To make this form use the input files (or standard input, when there isn't any), one must resort to method readlines, or to a while gets; ...; end form. This invocation is useful when one wants to process the text as an array of lines (using readlines), for reducing the text to a single information or when dealing with patterns that emerge from groups of lines. However, for line-oriented processing, the invocation
is preferable, as it implies the while gets; ...; end loop form. This means the PROGRAM will run as if it were within this implicit loop. This invocation is perfect for text filtering (conditional printing), as it does not automatically print lines. However, for text transformation (e.g. substitution of some pattern), lines are always printed after processing, so these programs would always finish with the print statement. Such programs are better run with the invocation
which is the same as using the -n flag, but adds an implicit print at the end of each loop iteration.
Division of lines in fields is one of the great features of Awk and it is also supported in Ruby: add the -a flag to the line-processing invocation.
This implicitly adds the splitting of each line at the start of the iteration in the implicit loop on gets: the resulting array of fields is stored in the global variable $F. Note that Ruby's field-related features are not as rich as Awk's. Rewriting a field in Awk (e.g. $2 = "something") automatically rewrites the full line $0 so it contains the new field value (and normalizes the spacing). In Ruby, replacing an element of the $F array does not rewrite $_. Therefore, field-oriented processing is typically done with the -n flag instead of the -p flag: when using a modified array of fields, the print statement must explicitly use $F instead of $_, so it cannot be implicit.
When comparing Awk and Ruby programs, an immediate note is that both languages handle linebreaks (or, more generally, input record separators) very differently. Namely, while Awk removes these separators from the record $0, Ruby leaves them on. In the latter, default linebreaks from the platform can be removed using the chomp method. Confusingly, this method has different semantics depending on whether it is called on a target of class String or without an explicit target: the former simply returns a linebreak-devoid copy of the target; the latter implicitly acts as if it targetted $_, but it also reaffects the copy to $_ and thus has a side effect the former does not have.
Keeping the linebreaks during processing can be both an advantage or an inconvenient, depending on the situation. For nondestructively testing whether the line is empty, given the semantics or Kernel::chomp (i.e. chomp without an explicit target), one must resort to the long-winded $_.chomp.empty? form, or to a less expressive and more computationally demanding regular expression matching ~/^$/ form (easily extended to ~/^\s*$/ to test on visibly-empty lines, that contain only whitespace). Using the -a invocation flag, one might also use the form $F.empty?, but this still involves some hoop jumping and unnecessary computations (in this case, the splitting of the line while the fields are not needed). However, keeping the linebreaks on makes their processing local, whereby in Awk linebreaks are set explicitly through global variable ORS. It also eases the printing of the line with the print statement, given that in Ruby, the default output linebreak used by print (global variable $\, equivalent to Awk variable ORS) has the nil value, which resolves to the empty string. Which leads us to a discussion of...
Both Ruby and Awk have the printf statement, which have equivalent semantics and are similarly context-free. In addition, they both have a print statement with equivalent semantics, but that work in different contexts. In Awk, the print statement writes all its arguments (after proper conversion to strings using internal routines) separated with the value of variable OFS and ends the output with the value of variable ORS. Similarly, Ruby's print writes all its arguments (after proper conversion to strings using each argument's to_s method) separated with the value of global variable $, and ends the output with the value of global variable $\.
The difference lies in the default values of both pairs of variables: in Awk, OFS defaults to FS and ORS defaults to RS. In contrast, $, and $\ both default to nil and nil.to_s returns the empty string. Field separation must therefore be set explicitly. As for line separation, when the last argument of print is the current line (which is the case when passing no argument, which implicitly prints the value of $_), this is of no matter, as the output naturally terminates with the line's original linebreak (unless method chomp was invoked before print, natch). However, when only printing a subset of the fields, or fields interspersed with new data, the value of $\ must also be set, which grows the one-liner.
However, Ruby offers the puts method as an alternative to print. This method writes all its string arguments (no automatic conversion with to_s) separated by the underlying platform's default linebreak. Hence, this is the method of choice to print a chomped line. In addition, Ruby has method Array#join to easily merge the contents of an array of strings (like $F) into a single string. Coupled with the puts method, this works around the annoyance of explicitly setting the output separator variables, especially when dealing with records composed of varying numbers of fields. When the output has a fixed number of fields, method printf often yields easier output control, as well as a shorter one-liner.
Many one-liners require the storage of some intermediate result into a variable for accumulation or processing at a later time. Awk, tailor-made for line-oriented processing, scopes all variables as global: wherever it is set, it can be accessed later, be it in another clause or in the BEGIN or END block. In addition, variables need not be explicitly initialized. If first used in context of additive accumulation, a variable is implicitly initialized to 0; if instead used in context of array element setting, it is implicitly initialized to an empty array.
Ruby is rather a general-purpose language shoehorned in line-oriented processing. Such a language requires more rules for managing complexity, which rules include local variable scope. An unadorned identifier is considered a local variable. Thus, if it is initialized in the BEGIN block of the program, its value is lost in the main body. Explicit global variables, which are prepended with a $ sign, must be used to bypass the limitations imposed by local scoping. Moreover, all variables must be explicitly set to a value (thus initialized) before first usage as the target or argument to a method call. Therefore, Ruby one-liners require a BEGIN block more often than equivalent Awk programs, for variable initialization.
A compelling feature of Awk is its capacity for treating fields of a line as numbers, for line-oriented number crunching. This ability stems from the fact that common arithmetic operators indicate that fields are expected to be numbers, so Awk automatically turns the string fields into integers or floating-point approximations, on which many common numerical operations can be computed.
Ruby can also handle numbers, but not in the implicit way Awk does, thanks to its flexibility in overloading operators. For better and for worse, Ruby implements both arithmetics and string operations (concatenation and repetition, to name them) with the same operator symbols. Hence, where in Awk two fields may be added together with the expression $1+$2, the equivalent Ruby form $F[0]+$F[1] results in the concatenation of the first two fields, not their numerical addition. Therefore, explicit string conversion methods such as to_i and to_f are necessary for number crunching using Ruby one-liners, resulting in larger programs than the Awk equivalents.
Redirection forms for the print statements are a very strong suit of Awk. Using a statement of the form print ... > "output.txt" results in the automatic opening of file output.txt, as well as its automatic closing at the end of the execution. Redirecting to subprocesses is just as elegant.
The closest Ruby form is the open method, which yields IO objects with print, puts, printf and gets methods that act similarly to their counterparts without an explicit target. In addition, open may both handle files as well as subprocess piping. Where this approach falters, however, is in the absence of automatic opening or closing. For an Awk program with structure
awk 'COND { print > "output.txt" }'
ruby -ne 'BEGIN{$f=open("output.txt","w")}; $f.print if COND; \ END{$f.close}'
The manual file management in BEGIN and END blocks mires the one-liner. Shell redirection can be used instead in many instances, but the Awk redirection forms remain easier to deal with.
One last approach worth mentioning is for the case of output redirection using the block variant of the open method. The program hinted above could instead be written as
ruby -ne 'open("output.txt","a"){|f| f.print} if COND'
which is almost as terse as the Awk program. However, this implies that file output.txt is opened for appending and then closed every time the condition COND is satisfied. This is obviously more computationally demanding than the Awk program.
Ruby starts shining with respect to Awk when text processing must involve additional libraries that extend the tools available. In Awk, anything that goes beyond file-like I/O and simple math must be piped in from or piped out to external programs. Welcome to Unix: Awk solves one simple problem expecting that if it is only part of the solution, the rest will be handled by other equally simple and specialized programs.
Ruby, however, runs in line with modern interpreted programming languages that label themselves as batteries included (this trope actually comes from Python, but it describes Perl and Ruby just as well). Want to get your input from a network connection? You can use Netcat (nc) and Awk, or just Ruby. Want to push the output to an e-mail and have it sent using Gmail/TLS with your Google credentials? You can use Awk and a painstakingly configured Sendmail, or you can use Ruby. Want to use as input a directory listing? You can use ls and Awk, or you can use Ruby. Also, given that Ruby was designed as a multiplatform runtime with optional platform-specific tools, one-liners fully written in Ruby are often more portable across systems, whether slightly different Unix variants or Microsoft Windows.
Extending the reach of Ruby using libraries is easily done even for one-liners, using the -r command-line flag. For instance, here is a one-liner that computes the MD5 digest of the input, emulating the md5sum program I often find lurking on Linux systems.
ruby -rdigest -ne 'BEGIN{$d=Digest::MD5.new}; $d << $_; END{puts $d.hexdigest}'
This program yields the same results on any Linux and BSD distribution, Mac OS X and Windows.
It is clear that both Awk and Ruby have their place in a scripter's text processing toolbox. It is demonstrated that both languages can handle the same text processing tasks, albeit with occasional awkwardness.
On the one hand, Awk tends to be better at filtering tasks, that is, conditional printing. Awk's forms, in this case, are more expressive and compact than Ruby's. Awk also excels in variable handling (automatic initialization of all-global symbols), lightweight numerical processing and I/O redirection. Chained in pipes involving other common Unix tools, Awk acts as an excellent filter.
On the other hand, Ruby seems to be better at reduction tasks, whereby the text is crunched to a single result or a different stream. Line-oriented Ruby one-liners are seldom shorter than Awk equivalents, except when concerned with explicit linebreak management. Ruby rather shines when its processing targets the whole text at once, stored into an array, using the ruby -e 'readlines...' invocation. From the array returned by readlines, multiple array-processing methods can be chained together, facilitating the detection of patterns that involve groups of lines or the transformation of the array.
In addition, Ruby has the strength of being a batteries-included general-purpose language. Its multiple libraries and add-ons can be brought to bear to design programs without relying on other tools and their system- or version-specific idiosyncrasies. This often results in more portable one-liners. Awk hearts Unix; Ruby hearts Ruby.