Discover Top Posts Tagged with #textprocessing

Optimized Rendering of Text with CoreText

Given a large body of text, how do you parse it into pages that can fit onto a 1024x768 iPad screen? Or even break it up into beautiful magazine-like columns?

At first, CoreText seemed like an intimidating framework given I was unfamiliar with its C syntax, so I tried the brute force way: iterate over each word until boundingRectWithSize said I couldn't fit any more words for my view's frame. In a collectionView cell, the performance was horrible, as you could imagine, so scrolling was jittery.

Next was to try CoreText. Initially, I followed the Core Text Programming Guide's Columnar Layout example, which uses CTFrameSetterRef to divide the pages. The implementation suggests subclassing UIView and overriding drawRect.

The result was much better. The parsing of the text happened instantaneously. But when profiled for memory allocations via Instruments, the app showed it was holding onto 12 mb of memory for each rendered page. Given an article could have a dozen pages rendered onto the scrollview, memory was quickly building up and soon enough, you'd experience an out-of-memory crash. I had to research more about CoreGraphics performance.

I came across a great article about Rendering Graphics in iOS (GPU vs. CPU), and an even greater follow-up by Apple's (former) UIKit developer, Andy Matuschak!

The most important lesson for me was that drawing a view with the CPU-based framework, CoreGraphics, presents the tradeoff of gaining higher performance for increased memory usage.

The final implementation uses CTFrameSetterRef to quickly divvy up the text into an array of CFRanges (converted into NSRanges), but now I render the the appropriate range of substring into a regular UILabel, only when I need it. The label now consumes 7mb of memory, down from 12mb, for a full screen of text on an iPad.

#objective-c #objectivec #coretext #textprocessing #ios #coregraphics #optimization

Text processing one-liners: Ruby vs. Awk

Since my review of a book about Awk, I've been thinking a lot about text processing in the Unix stream-oriented workflow. Before learning Awk, I used Sed for easy text substitution stuff, yet my go-to language for text processing was Ruby: I have written many small programs that simply iterate over each lines of a stream, processing. However, hailing from Perl, which hailed from Awk, Ruby is fully featured for text processing directly from the command line. As I was checking out its array of command-line options, the methods of its Kernel module and its multitude of predefined global variables, I wondered if those Awk one-liner programs would be just as easy and concise in Ruby.

So I verified by translating all of Eric Pement's Awk one-liners to Ruby one-liners. I come away with multiple tricks and observations as to the respective behaviors of Ruby and Awk, so let me show you them.

By the way, I know that most people rather use Perl than Ruby for command-line text processing with the power of a general purpose programming language. Well, I don't know any Perl, please don't tell anyone. Moreover, this subject will be covered in an upcoming e-book by Peteris Krumins, the author of two e-books that explain one-liner programs in Awk and Sed, respectively. I wish to add a bit to this bright fellow's wisdom, so let us carry on about Ruby.

Invocation

While Awk has few command line options and a single invocation profile, Ruby has multiple invocation types, all of which determine the structure of the text processing program the one-liner can do. For Awk, all programs have the structure that each line (record, actually, but all of Eric Pement's one-liners concern themselves with a division of a text file along line breaks) is divided in fields and are then processed by each program clause in succession, when the line or its fields satisfy the clause's condition. By default, the clause's condition is always true, so it is satisfied by all lines. Also, by default, the clause's body is to print the current line.

In contrast, the basic structure of command-line Ruby programs depends on multiple invocation flags. The most basic invocation is

ruby -e '<PROGRAM>' file1 file2 ...

which simply runs PROGRAM. To make this form use the input files (or standard input, when there isn't any), one must resort to method readlines, or to a while gets; ...; end form. This invocation is useful when one wants to process the text as an array of lines (using readlines), for reducing the text to a single information or when dealing with patterns that emerge from groups of lines. However, for line-oriented processing, the invocation

ruby -ne ...

is preferable, as it implies the while gets; ...; end loop form. This means the PROGRAM will run as if it were within this implicit loop. This invocation is perfect for text filtering (conditional printing), as it does not automatically print lines. However, for text transformation (e.g. substitution of some pattern), lines are always printed after processing, so these programs would always finish with the print statement. Such programs are better run with the invocation

ruby -pe ...

which is the same as using the -n flag, but adds an implicit print at the end of each loop iteration.

Division of lines in fields is one of the great features of Awk and it is also supported in Ruby: add the -a flag to the line-processing invocation.

ruby -ane ...

This implicitly adds the splitting of each line at the start of the iteration in the implicit loop on gets: the resulting array of fields is stored in the global variable $F. Note that Ruby's field-related features are not as rich as Awk's. Rewriting a field in Awk (e.g. $2 = "something") automatically rewrites the full line $0 so it contains the new field value (and normalizes the spacing). In Ruby, replacing an element of the $F array does not rewrite $_. Therefore, field-oriented processing is typically done with the -n flag instead of the -p flag: when using a modified array of fields, the print statement must explicitly use $F instead of $_, so it cannot be implicit.

Line endings

When comparing Awk and Ruby programs, an immediate note is that both languages handle linebreaks (or, more generally, input record separators) very differently. Namely, while Awk removes these separators from the record $0, Ruby leaves them on. In the latter, default linebreaks from the platform can be removed using the chomp method. Confusingly, this method has different semantics depending on whether it is called on a target of class String or without an explicit target: the former simply returns a linebreak-devoid copy of the target; the latter implicitly acts as if it targetted $_, but it also reaffects the copy to $_ and thus has a side effect the former does not have.

Keeping the linebreaks during processing can be both an advantage or an inconvenient, depending on the situation. For nondestructively testing whether the line is empty, given the semantics or Kernel::chomp (i.e. chomp without an explicit target), one must resort to the long-winded $_.chomp.empty? form, or to a less expressive and more computationally demanding regular expression matching ~/^$/ form (easily extended to ~/^\s*$/ to test on visibly-empty lines, that contain only whitespace). Using the -a invocation flag, one might also use the form $F.empty?, but this still involves some hoop jumping and unnecessary computations (in this case, the splitting of the line while the fields are not needed). However, keeping the linebreaks on makes their processing local, whereby in Awk linebreaks are set explicitly through global variable ORS. It also eases the printing of the line with the print statement, given that in Ruby, the default output linebreak used by print (global variable $\, equivalent to Awk variable ORS) has the nil value, which resolves to the empty string. Which leads us to a discussion of...

Print statements

Both Ruby and Awk have the printf statement, which have equivalent semantics and are similarly context-free. In addition, they both have a print statement with equivalent semantics, but that work in different contexts. In Awk, the print statement writes all its arguments (after proper conversion to strings using internal routines) separated with the value of variable OFS and ends the output with the value of variable ORS. Similarly, Ruby's print writes all its arguments (after proper conversion to strings using each argument's to_s method) separated with the value of global variable $, and ends the output with the value of global variable $\.

The difference lies in the default values of both pairs of variables: in Awk, OFS defaults to FS and ORS defaults to RS. In contrast, $, and $\ both default to nil and nil.to_s returns the empty string. Field separation must therefore be set explicitly. As for line separation, when the last argument of print is the current line (which is the case when passing no argument, which implicitly prints the value of $_), this is of no matter, as the output naturally terminates with the line's original linebreak (unless method chomp was invoked before print, natch). However, when only printing a subset of the fields, or fields interspersed with new data, the value of $\ must also be set, which grows the one-liner.

However, Ruby offers the puts method as an alternative to print. This method writes all its string arguments (no automatic conversion with to_s) separated by the underlying platform's default linebreak. Hence, this is the method of choice to print a chomped line. In addition, Ruby has method Array#join to easily merge the contents of an array of strings (like $F) into a single string. Coupled with the puts method, this works around the annoyance of explicitly setting the output separator variables, especially when dealing with records composed of varying numbers of fields. When the output has a fixed number of fields, method printf often yields easier output control, as well as a shorter one-liner.

Variable scope

Many one-liners require the storage of some intermediate result into a variable for accumulation or processing at a later time. Awk, tailor-made for line-oriented processing, scopes all variables as global: wherever it is set, it can be accessed later, be it in another clause or in the BEGIN or END block. In addition, variables need not be explicitly initialized. If first used in context of additive accumulation, a variable is implicitly initialized to 0; if instead used in context of array element setting, it is implicitly initialized to an empty array.

Ruby is rather a general-purpose language shoehorned in line-oriented processing. Such a language requires more rules for managing complexity, which rules include local variable scope. An unadorned identifier is considered a local variable. Thus, if it is initialized in the BEGIN block of the program, its value is lost in the main body. Explicit global variables, which are prepended with a $ sign, must be used to bypass the limitations imposed by local scoping. Moreover, all variables must be explicitly set to a value (thus initialized) before first usage as the target or argument to a method call. Therefore, Ruby one-liners require a BEGIN block more often than equivalent Awk programs, for variable initialization.

Numerical processing

A compelling feature of Awk is its capacity for treating fields of a line as numbers, for line-oriented number crunching. This ability stems from the fact that common arithmetic operators indicate that fields are expected to be numbers, so Awk automatically turns the string fields into integers or floating-point approximations, on which many common numerical operations can be computed.

Ruby can also handle numbers, but not in the implicit way Awk does, thanks to its flexibility in overloading operators. For better and for worse, Ruby implements both arithmetics and string operations (concatenation and repetition, to name them) with the same operator symbols. Hence, where in Awk two fields may be added together with the expression $1+$2, the equivalent Ruby form $F[0]+$F[1] results in the concatenation of the first two fields, not their numerical addition. Therefore, explicit string conversion methods such as to_i and to_f are necessary for number crunching using Ruby one-liners, resulting in larger programs than the Awk equivalents.

I/O redirection

Redirection forms for the print statements are a very strong suit of Awk. Using a statement of the form print ... > "output.txt" results in the automatic opening of file output.txt, as well as its automatic closing at the end of the execution. Redirecting to subprocesses is just as elegant.

The closest Ruby form is the open method, which yields IO objects with print, puts, printf and gets methods that act similarly to their counterparts without an explicit target. In addition, open may both handle files as well as subprocess piping. Where this approach falters, however, is in the absence of automatic opening or closing. For an Awk program with structure

awk 'COND { print > "output.txt" }'

the equivalent Ruby is

ruby -ne 'BEGIN{$f=open("output.txt","w")}; $f.print if COND; \ END{$f.close}'

The manual file management in BEGIN and END blocks mires the one-liner. Shell redirection can be used instead in many instances, but the Awk redirection forms remain easier to deal with.

One last approach worth mentioning is for the case of output redirection using the block variant of the open method. The program hinted above could instead be written as

ruby -ne 'open("output.txt","a"){|f| f.print} if COND'

which is almost as terse as the Awk program. However, this implies that file output.txt is opened for appending and then closed every time the condition COND is satisfied. This is obviously more computationally demanding than the Awk program.

Code libraries

Ruby starts shining with respect to Awk when text processing must involve additional libraries that extend the tools available. In Awk, anything that goes beyond file-like I/O and simple math must be piped in from or piped out to external programs. Welcome to Unix: Awk solves one simple problem expecting that if it is only part of the solution, the rest will be handled by other equally simple and specialized programs.

Ruby, however, runs in line with modern interpreted programming languages that label themselves as batteries included (this trope actually comes from Python, but it describes Perl and Ruby just as well). Want to get your input from a network connection? You can use Netcat (nc) and Awk, or just Ruby. Want to push the output to an e-mail and have it sent using Gmail/TLS with your Google credentials? You can use Awk and a painstakingly configured Sendmail, or you can use Ruby. Want to use as input a directory listing? You can use ls and Awk, or you can use Ruby. Also, given that Ruby was designed as a multiplatform runtime with optional platform-specific tools, one-liners fully written in Ruby are often more portable across systems, whether slightly different Unix variants or Microsoft Windows.

Extending the reach of Ruby using libraries is easily done even for one-liners, using the -r command-line flag. For instance, here is a one-liner that computes the MD5 digest of the input, emulating the md5sum program I often find lurking on Linux systems.

ruby -rdigest -ne 'BEGIN{$d=Digest::MD5.new}; $d << $_; END{puts $d.hexdigest}'

This program yields the same results on any Linux and BSD distribution, Mac OS X and Windows.

Summary

It is clear that both Awk and Ruby have their place in a scripter's text processing toolbox. It is demonstrated that both languages can handle the same text processing tasks, albeit with occasional awkwardness.

On the one hand, Awk tends to be better at filtering tasks, that is, conditional printing. Awk's forms, in this case, are more expressive and compact than Ruby's. Awk also excels in variable handling (automatic initialization of all-global symbols), lightweight numerical processing and I/O redirection. Chained in pipes involving other common Unix tools, Awk acts as an excellent filter.

On the other hand, Ruby seems to be better at reduction tasks, whereby the text is crunched to a single result or a different stream. Line-oriented Ruby one-liners are seldom shorter than Awk equivalents, except when concerned with explicit linebreak management. Ruby rather shines when its processing targets the whole text at once, stored into an array, using the ruby -e 'readlines...' invocation. From the array returned by readlines, multiple array-processing methods can be chained together, facilitating the detection of patterns that involve groups of lines or the transformation of the array.

In addition, Ruby has the strength of being a batteries-included general-purpose language. Its multiple libraries and add-ons can be brought to bear to design programs without relying on other tools and their system- or version-specific idiosyncrasies. This often results in more portable one-liners. Awk hearts Unix; Ruby hearts Ruby.

#Ruby #Awk #TextProcessing #OneLiner

Review of "Awk One-Liners Explained," by Peteris Krumins

# A history lesson As a target for writing code, the UNIX software ecosystem is strangely fascinating. Among its features, one standout is its reliance on text as a basic information storage unit. This sentence looks dumb out of context, because *of course*. But in the age of rich media and omnipresent design, the simplicity of carrying information on text alone is refreshing, comforting in its minimalism (doh! I've managed to bring up design; must resist). UNIX was also built on the principle that software designed for it should be made to manipulate text streams in such a way that programs could be connected like stations on a train track. The text train would go out from the first station, stop at each station for some processing and arrive at the last station rich with the solution to some problem. It will brand me as stuck in the past, but I am utterly seduced by this paradigm of powerfully plain, point-free, simple-step-by-simple-step data processing. This is why I jumped on the occasion to learn about the [awk](http://www.grymoire.com/Unix/Awk.html) utility, when I heard of [Peteris Krumins'](http://catonmat.net/about/) e-book [*Awk One-Liners Explained*](http://www.catonmat.net/blog/awk-book/) As curious as I am with text processing, I mostly do it with small Ruby programs and some simple uses of [Sed](http://www.grymoire.com/Unix/Sed.html). I know that Ruby, a language that I both love and hate (but mostly love) borrows from Perl, which borrows frow Awk, so I took it as a history lesson, at least. # A surprisingly useful tool The book is essentially a cookbook, as the author teaches Awk by example. He starts on simple examples that show the basics and progressively ramps up the complexity of examples, showing clever use of various Awk tools. The onus of the cookbook is on *one-liners*, Awk programs that can easily and readably hold on a single line of code. This is highly relevant as I've never seen or heard about a very complex Awk program: the few invocations of Awk I've seen in shell scripts indeed carried all the editing code inline. What the author suggests is that more complex tasks are rather accomplished by stringing Awk one-liners in a stream processing sequence, an idiom of UNIX programming that can integrate other tools (e.g. `sort`, `cut`, ``) to properly and elegantly solve the problem. It is an excellent Awk tutorial. The author pushes no introductory theory or generality and dives right into the first example. As he carries into the text, the general principles behind Awk programs are spelled out. This teaching approach works wonders and is well suited to showing Awk idiosyncrasies, compared to a classical exposition. The style and structure of the book encourages the reader to try the examples as he reads, which facilitates the assimilation of the material (in my opinion, as far as coding goes, practical knowledge is competence and theoretical knowledge is warm wind) and allows the reader to better feel when he grows tired and stops learning effectively. The book argues strongly in favor of the usefulness of the Awk text processor, as well as in the simplicity and the readability of Awk programs. While corresponding Sed one-liners would likely be more terse and compact, the C-like syntax has a flow and visual structure that makes Awk one-liners easy to understand, however concise they remain. Given that Awk is a standard POSIX utility, it is deployed in all UNIX systems that I care about (Ruby is not installed by default in even the most recent Ubuntu releases, which frustrated me a few times), and so it is a welcome addition to my common scripting tools. # But the `man` page says there's more! The main criticism I address to Krumins' otherwise excellent book concerns aspects that I would have liked covered in the book, or at least completely covered. One of the staples of modern programming languages is iteration through lists or arrays using specific statements (e.g. `for elem in array...`). Awk has such an iterative form, which is relevant for one-liner program; it also positions Awk among the inspirations for modern languages. Other features of Awk that could have made for interesting one-liners are the `next` and `nextfile` statements, as well as the `getline` statement. The latter is one that deserves treatment, given that it looks useful for combining two text streams into one and that it is as confusing as Canadian tax law. In addition, while the book makes for an excellent Awk tutorial, it is difficult to use as a reference for the Awk language. Maybe it would be better for this purpose if the author would add more page links in the index of the book. In his defense, though, Krumins publishes a free Awk cheat sheet, which does make for a good quick reference. And that's not counting the multiple other references that can be found on the web with just a bit of searching (many of which are of much poorer quality than the author's material, but one cannot have everything). However, Peteris Krumins set out to explain a bunch of one-liners, and he did so superbly. I have had the history lesson I had settled for, and came away with an exciting new tool as a bonus. Moreover, the book is quite short, so it may be read to satisfaction in one little afternoon. I recommend it. --- # Update #1 Peteris Krumins responded to my criticisms through Twitter, and I must retract on one of my points. In my defense, it has been two or three weeks that I've read the book, so hey. But the author does cover the `next` and `getline` statements. As for the latter, though, the alternative forms of the statement by which other files than the current input file, or even pipes, were not covered. These alternate forms are confusing: some change the `$N` fields, others don't; some forms advance `NR`, others don't. The [documentation of GNU Awk](http://www.gnu.org/s/gawk/manual/html_node/Getline.html), for instance, goes to great length to delineate the cases, and putting up a cheat sheet would come in handy. And it is these forms that could allow the combination of two data streams in Awk. Here's a one-liner that alternates the lines from files named `file1` and `file2`: awk '{ print; if(getline < "file2") print; } \ END { while(getline < "file2") print; }' file1 The first block prints all lines from `file1`; then, calling `getline < "file2"` fills `$0` with the next line from `file2`, which then gets printed. As a function, `getline` returns 1 on successful reading of a record, 0 when reaching the end of the file and -1 on error (meaning my one-liner does not handle errors correctly, but it's not the point of one-liners, is it?). Thus, one the one hand, if `file2` has less lines than `file1`, only the lines of the latter keep being printed. On the other hand, if `file2` is longer, the printing of its remainder is handled by the END block. Long digression, but in the end, I mean to insist on one thing: this Awk book is authoritative, concise and highly readable. I enjoyed it to the end. # Update #2 In yet another conversation with Mr. Krumins, he reminded me that the premise of his book was to paraphrase the [set of one-liners](http://www.pement.org/awk/awk1line.txt) originally published by Eric Pement. Thus, it is clear that from the beginning, he had no intention of covering the full extent of Awk features, which puts my no. 1 criticism somewhat beside the point. Indeed. But if this misguided comment put one more nice one-liner about `getline` out there (see update #1), I guess we'll all see some good came out of it.

#Awk #BookReview #PeterisKrumins #TextProcessing