Sunday, July 12, 2015

Plumbum Scripting

Scripting in Bash is a pain. Bash can do almost anything, and is unbeatable for small scripts, but it struggles when scaling up to doing anything close to a real world scripting problem. Python is a natural choice, especially for the scientist who already is using it for analysis. But, it's much harder to do basic tasks in Python. So you are left with scripts starting out as Bash scripts, and then becoming a mess, then being (usually poorly) ported to Python, or even worse, being run by a Python script. I've seen countless Python scripts that run Bash scripts that run real programs. I've even written one or two. It's not pretty.

I recently came (back) across a really powerful library for doing efficient command line scripts in Python. It contains a set of tools that makes the four (five with color) main tasks of command line scripts simple and powerful. I will also go over the one main drawback of the library (and the possible enhancement!).

Note: The colors module is new to Plumbum in 1.6.0.

Local commands

The first and foremost part of the library is a replacement for popen, subprocess, etc. of Python. I'll compare the "correct, current" Python standard library method and Plumbum's method.

Basic commands

Our first task will simply be to get our feet wet with a simple command. Let's run ls to see the contents of the current directory. This is easy with subprocess.call:

In [4]:
import subprocess
In [2]:
subprocess.call(["echo", "I am a string"])
Out[2]:
0

What just happened? The result, zero, was the return code of the call. The output of the call went to stdout, so if we were in a terminal, we would have seen it output (and in IPython notebook, it will show up in the terminal that started the notebook). This might be what we want, but probably we wanted the value of the output. That would be subprocess.check_output:

In [3]:
subprocess.check_output(["echo", "I am a string"])
Out[3]:
b'I am a string\n'

As you can already see, this not only requires different calls for different situations, but it even gave a bytes string (which is technically correct, but almost never what you want for a shell script). The reason for the different calls is because they are shortcuts to the actual subprocess Popen object. So we really need:

In [10]:
p = subprocess.Popen(["echo","I am a string"],
       shell=False, bufsize=512,
       stdin=subprocess.PIPE,
       stdout=subprocess.PIPE,
       stderr=subprocess.PIPE)

outs, errs = p.communicate()
outs
Out[10]:
b'I am a string\n'

As you can guess, this is only a small smattering of the options you can pass (not all were needed for this call), but it gives you an idea of what is needed to work with subprocess.

Let's look at Plumbum. First, let's see the fastest method to get a command:

In [7]:
from plumbum import local, FG, BG, TF, RETCODE
echo = local['echo']
echo("I am a string")
Out[7]:
'I am a string\n'

Here, we have a local object, which represents the computer. It acts like a dictionary; if you put in a key, you get the command that would be called if you run that command in a terminal. Let's look at the object we get:

In [5]:
echo
Out[5]:
LocalCommand(<LocalPath /bin/echo>)

Now this is a working python object and can be called like any Python function! In fact, it can access most all of the details and power of the Popen object we saw earlier. If you don't like to repeat yourself, there is a magic shortcut for getting commands:

In [6]:
from plumbum.cmd import echo

There is no echo command in a cmd.py file somewhere; this dynamically does exactly what we did, calling ['echo'] on the local object. This is quicker and simpler, but it is good to know what happens behind the scenes!

Plumbum also allows you to add arguments to a command without running the command; as you will soon see, this allows you to build complex commands just like bash. If you use square brackets instead of parenthesis, the command doesn't run yet (Haskal users: this is currying; Pythonistas will know it as partial)

In [7]:
echo["I am a string"]
Out[7]:
BoundCommand(LocalCommand(<LocalPath /bin/echo>), ['I am a string'])

When you are ready, you can call it:

In [8]:
echo["I am a string"]()
Out[8]:
'I am a string\n'

Or, you can run it in the forground, so that the output is sent to the current terminal as it runs (this is the subprocess.call equivilent from the beginning, although non-zero return values are not handled in the same way):

In [8]:
from plumbum import FG
echo["I am a string"] & FG

Complex commands (piping)

Stdin

Now, how about input a python text string to a command? As an example, let's use the unix dc command. It is a desktop calculator, with reverse polish notation syntax.

In [10]:
from plumbum.cmd import dc

We can call it using the -e flag followed by the calculation we want to preform, like 1 + 2. We already know how to do that,

In [11]:
dc('-e', '1 2 + p')
Out[11]:
'3\n'

But, it also can be run without this flag. If we do that, we can then type (or pipe) text in from the bash shell.

In subprocess, we don't have a shortcut, so we have to use Popen, manually setting the stdin and stdout to a subprocess PIPE, and then communicate in bytes.

In [9]:
proc = subprocess.Popen(['dc'],
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE)
outs, errs = proc.communicate('1 2 + p'.encode('ascii'))
outs
Out[9]:
b'3\n'

Compare that to Plumbum:

In [13]:
(dc << '1 2 + p')()
Out[13]:
'3\n'

Piping

Of course, in bash we can pipe text from one command to another. Let's compare that (not going to even try the subprocess call here).

Since I'm using IPython, prepending a line with ! will cause it to run in bash. So, in Bash:

In [21]:
!echo "1 2 + p" | dc
3

In Plumbum:

In [17]:
(echo["1 2 + p"] | dc)()
Out[17]:
'3\n'

If we wanted to see what that command would look like in bash, then we can call print on the unevaluated object:

In [18]:
print(echo["1 2 + p"] | dc)
/bin/echo '1 2 + p' | /usr/bin/dc

Background execution

One of the great things about Bash is the ease of "simple" multithreading; you can start a command in the background using the & character. To test this, we need a long running command that returns a value. In bash, we can make this using the following function:

$ fun () { sleep 3; echo 'finished'; }
$ fun
finished
$ fun &
[1] 6210
$finished
[1]+  Done                    fun

Here, when we ran it in the foreground, it held up our terminal until it finished. The second time we ran it, it gave us back our terminal, but we were interrupt ed three seconds later with text from the process. If we wanted to interact with the process, or wait for it to finish, etc, we could do $! to get the pid of the last spawned process, and then use wait to wait on the pid. (see git-all.bash for an example).

This simplicity is not usually something that is easy to emulate in a programing language. Let's see it in plumbum. Here, I'm piping sleep (which doesn't print anything) to echo, just so I can get a slow running command, and I'm using IPython's time magic to measure the time taken:

In [19]:
%%time
sleep = local['sleep']
sleep_and_print = sleep['3'] | echo['hi']
print(sleep_and_print())
hi

CPU times: user 5.75 ms, sys: 9.25 ms, total: 15 ms
Wall time: 3.01 s
In [26]:
%%time
bg = sleep_and_print & BG
CPU times: user 3.94 ms, sys: 7.45 ms, total: 11.4 ms
Wall time: 20.4 ms

Now, bg is a Future object that is attached to the background process. We can call .poll() on it to see if it's done or .wait() to wait until it returns. Then, we can access the stdout and stderr of the command. (stdout, etc. will automatically wait() for you, so you can use them directly.)

In [27]:
%%time
print(bg.stdout)
hi

CPU times: user 2.14 ms, sys: 1.76 ms, total: 3.9 ms
Wall time: 2.73 s

Remote commands

Besides local commands, Plumbum provides a remote class for working with remote machines via SSH in a platform independent manner. It works much like the local object, and will use the best system, including Paramiko, to do the processes. I haven't moved my scripts from pure Paramiko to Plumbum yet, but only having to learn one procedure for both local and remote machines is a huge plus (and Paramiko is fairly ugly to program in, like subprocess).

Command Line Applications

Command line applications on Python already have one of the best toolkits available, argparse (C++'s Boost Program Options library is a close second). However, after seeing the highly pythonic Plumbum cli module, it feels repetitive and antiquated.

Let's look at a command line application that takes a couple of options. In argparse, we would need to do the following:

In [25]:
%%writefile color_argparse.py
import argparse

def main():
    parser = argparse.ArgumentParser(description='Echo a command in color.')
    parser.add_argument('-c','--color', type=str,
                       help='Color to print')
    parser.add_argument('echo',
                       help='The item to print in color')

    args = parser.parse_args()
    print('I should print', args.echo, 'in', args.color, "- but I'm lazy.")
    
if __name__ == '__main__':
    main()
Overwriting color_argparse.py
In [26]:
%run color_argparse.py -c red item
I should print item in red - but I'm lazy.

As you can tell from the documentation, the programs quickly grow as you try to do more advanced commands, grouping, or subcommands. Now compare to Plumbum:

In [4]:
%%writefile color_plumbum.py
from plumbum import cli

class ColorApp(cli.Application):
    color = cli.SwitchAttr(['-c','--color'], help='Color to print')
    
    def main(self, echo):
        print('I should print', echo, 'in', self.color, "- but I'm lazy.")

if __name__ == '__main__':
    ColorApp.run()
Overwriting color_plumbum.py
In [5]:
%run color_plumbum.py -c red item
I should print item in red - but I'm lazy.

Here, we see a more natural mapping of class -> program, and also we have a lot more control over the items this way, as well. For example, if we want to add a validator, say to check existing files or to ensure a number in a range or a word in a set, we can do that on each switch. Switches can also be full fledged functions that run when the switch is set. And, we can easily extend this process to subcommands (see git-all.py) and it remains readable and avoids duplication.

Path manipulations

Path manipulations using os.path functions are messy and can become involved quickly. Things that should be simple require several functions chained to get anywhere. The situation was bad enough to warrant adding an additional module to Python 3.4+, the provisional pathlib module. Now this is not a bad module, but you have to install a separate library on Python 2.7 or 3.3 to get it, and it has a couple of missing features. Plumbum provides a similar construct, and it is automatically available if you are already using Plumbum, and it corrects two of the three missing features. The features I'm mentioning are:

  • No support for manipulation of multiple extensions, like .tar.gz
    • Plumbum supports an additional argument to .with_suffix(), default matches pathlib
  • No support for home directories
    • Plumbum provides the local.env.home path
  • No support for using open(path) without wrapping in a str() call
    • Can't be fixed unless path subclasses str (not likely for either library, see unipath), or pathlib support added to the system open function (any Python devs reading? Please?)

I would love to see the pathlib module adapt the .with_suffix() addition that Plumbum has, and add some sort of home directory expansion or path, as well.

Plumbum also has the unique little trick that // automatically calls glob, making path composition even simpler. I doubt we'll get this added to pathlib, but I can always hope (at least, until someone removes the provisional status).

Color support (NEW)

I've been working on a new color library for Plumbum. git-all.py has been converted to use it.

Colors are used through the Styles generated by the colors object. You can get colors and atributes like this:

from plumbum import colors
red = colors.fg.red # Red forground color
other_color = colors.bg(2)  # The second background color
bold = colors.cold
reset = colors.reset

You can directly access colors as if it was the fg object. Standard term colors can be accessed with (), and the 256 extended colors can be accessed with [] by number, name (camel case or underscore), or html code. All objects support with statements, which restores normal font (for a single Style, it will reset only the necessary component if possible, like bold or fg color). You can manually take the inverse (~) to get the undo-ing action. Calling a Style without any parameters will send it to stdout. Using | will wrap a string with a style (as will [] notation). Styles otherwise act just like normal text, so they can be added to text, etc (they are str subclasses, after all).

For the following demo, I'll be using the HTMLCOLOR, and a with statement to capture ouput in IPython and display it as HTML. (See my upcoming post for a more elegant IPython display technique.) Also note redirect_stdout is new in Python 3.4, but is easy to implement in other versions if needed.

In [1]:
from plumbum.colorlib import htmlcolors as colors
In [2]:
from IPython.display import display_html
from contextlib import contextmanager, redirect_stdout
from io import StringIO # Python3 name

@contextmanager
def show_html():
    out=StringIO()
    with redirect_stdout(out):
        yield
    display_html(out.getvalue(), raw=True)

Now, inside the capture context manager, we can use COLOR just like on a terminal (save for needing to use </br> to break lines if we don't take advantage of the build in htmlstyle print command, and having to be careful not to use un-reset Styles).

In [6]:
with show_html():
    colors.green.print("This is in red!")
    (colors.bold & colors.blue).print("This is in bold blue!")
    colors.bg['LightYellow'].print("This is on the background!")
    colors['LightBlue'].print("This is also from the extented color set")
    print("This is {colors.em}emphasized{colors.em.reset}! (reset was needed)".format(colors=colors), end='<br/>')
    print("This is normal")
This is in red!
This is in bold blue!
This is on the background!
This is also from the extented color set
This is emphasized! (reset was needed)
This is normal

Putting it together in an example: git-all

Now, let's look at a real world example previously mentioned: git-all.bash. This is a script I wrote some time ago for checking a large number of repositories in a common folder. Due to the clever way git subcommands work, simply naming this git-all and putting it in your path gives your a git all command. It is written in very reasonable bash, IMO, and works well.

Directory manipulation

Let's look at this piece by piece and see what would be required to convert it to Python. First, this script is in one of the repo's, so we need the current directory, up one.

In Bash that's:

unset CDPATH
SOURCE="${BASH_SOURCE[0]}"
while [ -h "$SOURCE" ]; do 
  DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
  SOURCE="$(readlink "$SOURCE")"
  [[ $SOURCE != /* ]] && SOURCE="$DIR/$SOURCE"
done
DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
REPOLOC=$DIR/..

(Sorry for the awful highlighting by IPython, it hates the $ in strings for Bash.)

Converted to python,

REPOLOC = local.path(__file__) / '..'

We can find the directories that are valid repos:

for file in $(ls); do
     if [[ -d $REPOLOC/$file/.git ]]; then
     ...
     done

And code goes here.

In Python, lists are easy to use:

valid_repos = [d / '../..' for d in local.cwd // '*/.git/config']

The multiple ugly loops over all repos can easily translate into a generator:

def git_on_all(bold=False):
    for n,repo in enumerate(valid_repos):
        with local.cwd(repo):
            with color_change(n):
                yield repo.basename

To use it, simply loop over git_on_all():

for repo_name in git_on_all():
    print('The current working directory is in the', repo_name, 'repo!')

Command line arguments

We don't have a nice cli tool in Bash, so we have to build long if statements. We can separate each command in Python, and let the help file be built for us:

@GitAll.subcommand("pull")
class Pull(cli.Application):
    'Pulls all repos in the folder, not threaded.'
    def main(self):
        for repo in git_on_all():
            git['pull'] & FG

This is git all pull, clean and seperated from the ugly loops in Bash.

Multithreading

The fetch loop, one of the strong points of the Bash script, looks like this:

if [[ $1 == qfetch ]]
  || [[ $1 == fetch ]]
  || [[ $1 == status ]]; then

    for file in $(ls); do
      if [[ -d $REPOLOC/$file/.git ]]; then
        cd $REPOLOC/$file
        git fetch -q &
        spawned+=($!)
      fi
    done

    echo -n "Waiting for all repos to report: "
    for pid in ${spawned[@]}; do
      wait $pid
    done
    echo "done"
fi

This does a normally advanced multithreading process in a few, simple lines. In Python, we have

def fetch():
    bg = [(git['fetch','-q'] & BG )
              for repo in git_on_all()]
    printf('Waiting for the repos to report: ')
    for fut in bg:
        fut.wait()
    print('done')

This is just as readable, if not more so, and doesn't need the if loop to check the input, since that's now part of the cli interface. The actual version in the script also can report errors in the fetch, which the Bash version can not.

Colors (classic tput method)

We would like to toggle colors, so each repo is in a different cyclic color. My final Bash solution was elegant:

Bash (will need to run echo -n on these):

txtreset=$(tput sgr0)
txtbold=$(tput bold)

Python would be able to do the same thing (will only need to run these in the foreground, with & FG ):

txtreset=tput['sgr0']
txtbold=tput['bold']

Though with the plumbum.color library, we don't have to.

Color changing is easy to implement with a Python context manager:

@contextmanager
def color_change(color):
    txtreset & FG
    txtbold & FG
    tput['setaf',color%6+1] & FG
    try:
        yield
    finally:
        txtrst & FG

The try/finally block allows this to restore our color, even if it throws an exception! This is tremendously better than the Bash version, which leaves the color on the terminal if you make a mistake. A nice example of context managers can be found on Jeff Preshing's blog.

You can use it to wrap parts of the code that print in a color:

with colorchange(tput('setaf',2), bold=True):
    print('This will be in color number 2')

Colors (new method)

Plumbum has a new colors tool, and this is how you would use in in this script.

from plumbum import colors

Colors can be generated cyclically by number, and combinations of color and attributes can be put in a with statement, too:

with(colors.fg[1:7][n%6] & colors.bold):

And, we can simply unbold:

colors.bold.reset.print(git('diff-files', '--name-status', '-r', '--ignore-submodules', '--'))

And that's it! All the benefits we had from before are here.

Final Comparison

I'll be using functions in the Python version to make it clear what each git call does, and making the Python version cleaner in a few ways that I could also apply to the Bash script. So this is not meant to be a 1:1 comparison. In my defense, Bash users tend to avoid functions or other clean programming practices.

Most of the extra lines are from the Python functions. Also, I've improved a couple of commands for git for current best practices. I've also avoided using FG for the print commands, so that I can control the color and the long-output paging (If you change print() for & FG, the output would match the Bash script). Here is the script: git-all.py.

Note: You might want to look at the history of that script, as I'll probaby update it occasionally as I start using it.

Notice that it is very clear what each part of the cli part of the script, and it's easy to add a feature or extend it. The long for loops are nicely abstracted into iterators.

Also, there may be bugs for a few days while I start using this instead of my bash script. Also, it must be renamed to git-all with no extension for git all status etc. to work.

Bonus: Possible improvement: argcomplete support

One last thing: The one drawback to Plumbum over argparse is due to one enhancement package for argparse. I think a great addition to the Plumbum library would be argcomlete support. If you've never seen argcomplete, it's a bash completion extension for argparse. Argcomplete allows you to add validators, and will suggest completions when pressing tab on the command line.

Adding support to Plumbum would not be that hard, and probably wouldn't even require argcomplete to be installed. The Argcomplete API requires three things:

  • #ARGCOMPLETE_OK near the top of a script
  • Special output piped to several channels (8 and 9, I believe) of the terminal when the _ARGCOMPLETE special environment variable is set, and then exits before calling .main().
  • The ability to predict the next completion

The first one is easy, and wouldn't require anything from Plumbum. The second would be a simple addition to a new method cli.Application.argcomplete(self) that could be overridden to remove or customize argcomplete support. The final one is the hard one, the prediction of the possible completions. If that can be done, support could be added.

Because support would be added into Plumbum itself, you wouldn't have to use the monkey patching that argcomplete has to use to inject itself into argparse. You would still use the same bash hooks that argcomplete uses, so it would work along side it, being called in the same way.

No comments:

Post a Comment