Let's Review Some Bash: Printing Dangling Multipart Uploads to S3 for Processing
████████: @timvisher: can you give this a bash code review for me?
bengarvey: @████████ last time I asked Tim Visher to review my bash script I rewrote it in a different language
In this (potential) series I take bash code that comes across my inbox and review it for all the world to see. It's an attempt to sharpen my own scripting skills as well as potentially be a source for others to get better at scripting themselves. I still consider myself an amateur scripter and my opinions are my own. If you really want to learn from the best, talk to Greg.
In this edition of Let's Review Some Bash we're going to look at a script that is supposed to cleanup dangling multipart S3 uploads. We're going to transform this
aws s3api list-multipart-uploads --bucket "$BUCKET" --output json | jq -r '.Uploads[] | "\(.Initiated) \(.Key) \(.UploadId)"' | awk ' BEGIN { cmd="date -u +\"%Y-%m-%dT%H:%M:%S.%3NZ\" -d \"7 days ago\"" cmd | getline last_week close(cmd) } $1 **–gnu:** Behave like GNU parallel. This option historically took
precedence over –tollef. The –tollef option is now retired, and therefore may not be used. –gnu is kept for compatibility.
Got it. So that flag at least appears to be unnecessary.
I'm honestly unable to determine what the - is doing there. Every time I tried to test it locally it resulted in a bash invocation error as parallel attempted to execute something like bash - foo bar. I'll assume that what was really meant to be happening here is having parallel act like cat | sh as is stated in the manual if you don't specify a command to pass arguments to. Perhaps the script wasn't fully written at the time. Since we've whittled the parallel command all the way down to a fancy | sh and we're not actually passing it anything to execute, let's get rid of it altogether.
Let's see if there's anything to do with the aws command.
#!/usr/bin/env bash ########################################################################## ### cleanup_multipart_s3_uploads finds multipart uploads that have been ### executing for more than a week and prints them for processing ### elsewhere ########################################################################## # Will execute in AWS_DEFAULT_REGION or whatever region your current # config points to aws s3api list-multipart-uploads --bucket "$BUCKET" --output json | jq -r '.Uploads[] | "\(.Initiated) \(.Key) \(.UploadId)"' | awk ' BEGIN { cmd="date -u +\"%Y-%m-%dT%H:%M:%S.%3NZ\" -d \"7 days ago\"" cmd | getline last_week close(cmd) } $1 jq -rn '[1, "foo bar bat", null, "boop"] > | @sh')" arg: ‘1’ arg: ‘foo bar bat’ arg: ‘null’ arg: ‘boop’
Note: This is much more dangerous than the manual makes it sound because of Word Splitting. Watch what happens if I don't use eval and string templating:
$ printf 'arg: ‘%s’\n' $(jq -rn '[1, "foo bar bat", null, "boop"] | @sh') arg: ‘1’ arg: ‘'foo’ arg: ‘bar’ arg: ‘bat'’ arg: ‘null’ arg: ‘'boop'’
I'll leave it as an exercise to the reader to see just how many things that could break.
Now let's assume that we're not super interested in safety and we're pretty sure we're only going to run this script attended. In that case:
unset round while IFS=$'\t' read -r a b c; do (( round++ )) printf 'arg round '"$((round))"': ‘%s’\n' "$a" "$b" "$c" done arg round 1: ‘e st as’ # arg round 1: ‘f bou’ # arg round 1: ‘g sthaouth a u u’ # arg round 2: ‘h’ # arg round 2: ‘i’ # arg round 2: ‘j’
There's kind of a lot going on here so let's dig in before getting back to how it'll be used in the script. Overall we're using the @tsv jq formatter combined with a good 'ol while read … loop and IFS. This is the venerable FAQ 001. So long as our input data doesn't contain any tab characters itself then at least we should not get bad word splitting happening.
We're unsetting round on the off chance that the variable exists in your session already and then executing the while read loop. For each read we're temporarily setting IFS to the mysterious looking value $'\t'. If you're not familiar with the many types of quoting there are in bash this one is the ANSI-C Quote and exists to allow you to insert common escape sequences from ANSI-C into a word. In this instance we're setting IFS to the literal tab character but because it's hard to tell whether a literal tab (which would work just as well from bash's perspective) is a tab or spaces we use an ANSI-C Quote to make the distinction obvious.
The read operation uses -r to disable backslash sequence interpretation which is almost always what you want to do and then reads three values from the line. The values are determined by IFS which is why we set it to tab and will be the first second and rest fields from the line. To quickly demonstrate what I mean:
$ IFS=$'$\t' read -r first second rest &2 fi
but I digress. As we've already stated we're not super interested in safety. Just mild, every day interest. Now for each successful read (which if you're wondering will not succeed once the STDIN closes) we're going to increment the variable round in an arithmetic context. These are neat because they assume that words are variables. In many cases it's actually less efficient to use parameter expansion if your variable is already a number simply because you're going through an extra round in the interpreter. The other cool thing is that in bash an unset variable and a variable set to null and a variable set to 0 are essentially the same thing as values. So we don't need to 'set up' bash at all to use the round variable. We just increment it and because it's unset bash initializes it to 0 for us and then adds one. I love languages that don't get in my way.
Then we printf for all 3 args we receive, embedding curly quotes into it and the round. printf prints it's format string once per input variable but that's probably not what you're wondering about. What you're probably wondering about is why those weird '" or "' sequences are happening in the command.
This blew my mind when I first saw it but it's one of the most powerful things to understand in bash. What really opens up bash scripting to you is the realization that strings don't exist in this language. This is maybe the most common misconception I see across the web. You'll see people doing things like --arg'foo'= and you can just feel their Algol roots digging deeper as they desperately cling to the idea that somehow bash knows that there's an argument to this command arg and it should 'set' it to the string 'foo'. Nothing could be further from the truth. Bash doesn't know anything about the command you're trying to run. Everything in bash exists to help you assemble an array of words to so it can set ARGV for a system call. The use of that term, words, marked a serious point of growth in me for effective scripting. Once you realize that
--arg='foo'= == --'ar'g=foo == "--"arg''""''""='fo'o
your world really opens up. You begin to realize that the quotes are not delimiting 'strings' as they do in most languages but instead are activating different kinds of bash features.
In this instance we use single quotes which deactivate all bash expansion features until we want to embed round in the string. Then we close the initial single quote context and open, without any spaces, a double quote context which turns on many kinds of expansions including the arithmetic expansion we use here ($(( … ))). Then we close out that context and continue with a single quote context. Note that the embedded \n in this context is not expanded by the shell but instead passed to the printf C function. If we had so desired we could have done something wild like
printf 'arg round '"$(( round ))"': ‘%s’'$'\n'
Note that my editor doesn't even know how to syntax highlight that but bash sure knows how to interpret it.
The upshot of this is that we'll get our arguments printed with which round they were read in between those fancy quotes.
Finally what are we reading from? Well we're redirecting from a process substitution which essentially open a temporary file descriptor that well get cleaned up for you once the process exits. We then use a here document with jq and a json structure that apes essentially what we'll find in our aws output and voila, we're done! It blew my mind that here document and redirection locations don't matter but are parsed off for you by bash wherever they occur.
To wrap that in to the original script:
#!/usr/bin/env bash ########################################################################## ### cleanup_multipart_s3_uploads finds multipart uploads that have been ### executing for more than a week and prints them for processing ### elsewhere ########################################################################## # Will execute in AWS_DEFAULT_REGION or whatever region your current # config points to aws s3api list-multipart-uploads --bucket "$BUCKET" --output json | jq -r '.Uploads[] | [.Initiated, .Key, .UploadId] | @tsv"' | awk -F $'\t' ' BEGIN { cmd="date -u +\"%Y-%m-%dT%H:%M:%S.%3NZ\" -d \"7 days ago\"" cmd | getline last_week close(cmd) } $1 ‘e st as’ ‘f bou’ ‘g sthaouth a u u’ # ‘h’ ‘i’ ‘j’
Speaking of awk, It's usage here is interesting. I'm not an awk expert but as near as I can tell this is using awk to shell out to GNU date (not portable) to get a formatted 7 days ago date which it then uses to print the 3 values we got from jq if Initiated is less than that (or more than 7 days ago).
With our new found quoting powers we could at least just make that
awk -F $'\t' ' $1 ["DEBUG:",{"Initiated":"2020-04-03T12:41:48.000Z","Key":"a key","UploadId":"an id"}] # 2020-04-03T12:41:48.000Z a key an id # ["DEBUG:",{"Initiated":"2020-04-04T12:41:48.000Z","Key":"another key","UploadId":"another id"}] # ["DEBUG:",{"Initiated":"2020-04-05T12:41:48.000Z","Key":"and one more key","UploadId":"and one more id"}]
OK now we're getting out of control. We're using jq in the first place with it's null input argument so that we can construct a set of test records at various Initiated dates (8, 7, and 6 days ago respectively) in the shape our script expects. Then we pipe that to a second jq invocation (not strictly necessary) and demonstrate filtering the records based on the original awk criteria and boom we no longer need awk. I added the debug filter in there just so it's very easy to see that all of the records come through but only the one 8 days ago is printed.
#!/usr/bin/env bash ########################################################################## ### cleanup_multipart_s3_uploads finds multipart uploads that have been ### executing for more than a week and prints them for processing ### elsewhere ########################################################################## # Will execute in AWS_DEFAULT_REGION or whatever region your current # config points to aws s3api list-multipart-uploads --bucket "$BUCKET" --output json | jq -r '.Uploads[] | select(.Initiated











