Entries for October 2018, Research Software Engineering at Warwick

October 31, 2018

Changing your Identity

A pretty quick entry for Halloween, that comes up pretty often - if you're accessing several remote machines with `ssh`, how can you have custom "identities", i.e. custom key pairs? There's a few reasons to do this, including history (already set up some keys for different remotes on two different machines) and security (perhaps one set of keys needs to be longer (more bits), or have a better passphrase).

SSH key pairs

First up, what are SSH keys? SSH (secure shell) is a way of getting a command prompt or terminal on a remote machine, such that all communication between the machine you're sat at and the remote is encrypted, and can't be spied on. This uses Public Keycryptography. What this means, is that on your local machine, you have some private key that you keep safe. You give out the public key to any remotes you want to access. With the public half, they cannot pretend to be you, but can verify that you are you. If they use the public half to encrypt some data, it can only be decrypted using the private key you hold.

You generate these key pairs with a command like `ssh-keygen -t rsa -b 4096` which then gives the option of using a custom name for the pair, and adding a passphrase. In general you should always use a secure but memorable passphrase. This is an extra layer of security - even if the private key somehow escapes you, it still cannot be used with out obtaining or cracking this phrase.

Custom Pairs in SSH

Lets assume, for whatever reasons, you have created several named key pairs. Perhaps you need a longer key pair for an extra-secure system, but don't want to have to update every single remote machine you may use, or perhaps some system has asked you to change your keys. Alternately, you may have generated new keys on some other machine, but want to add them as an extra set without overwriting what you already have. Regardless of how or why, you have a key pair called something that is NOT id_rsa, and want to use this to access some system.

The simplest way to do this, and it is really simple, is to use the `-i` or "identity" flag to ssh. You run something like `ssh -i ~/.ssh/custom_rsa username@hostname` where custom_rsa is the private key you wish to use. You can use exactly the same idea to transfer files using scp, which just passed the given filename on to ssh.

Custom keys with ssh-agent

A simpler, but less efficient option is to simply add all the keys you ever use to the ssh-agent on a linux/osx machine, and ssh will try each in turn. This is great if you have one or two, but gets to be a pain if you have many. ssh-agent can be a bit fiddly, and is perhaps overkill for this simpletask, but for more details see here.

Custom Identities with config

If, as well as custom keys you also have a lot of different usernames to deal with (for instance, an SCRTP username, a Github username etc), it can be helpful to set up an ssh config file (~/.ssh/config). This lets you define usernames, Authentication type (password, key etc) and identities for multiple remote machines. It's also handy to create shortcut names, if for instance, you have a very long or complex named remote you access often.

Doing this is really easy in general. Just create a file, "config" in the (hidden) ".ssh" directory. For each machine you want to use, add an entry that looks like

Host shortcutName
  HostName actualNameOrIP
  User userName
  PreferredAuthentications {"publickey" and/or "password"}
  IdentityFile pathToPrivateKey

The Host line is essential, all the rest can be omitted if you don't want them. For instance, I have a section like

Host homeserver
  HostName myname.ddns.com
  User me
  IdentityFile ~/.ssh/id_rsa_server

to use a special key when I use my home server. This uses a web service to link some name to the actual IP address, hence the ddns hostname. I could then ssh to that by doing just `ssh homeserver` and it'll use the given hostname, username, and identity file. The Hostname line can also contain an IP address which is also very useful and saves you having to remember it. E.g.

Host docserver
  Hostname 192.0.0.5
  User me

so that I don't have to always remember the IP.

Custom Identities with Programs wrapping SSH

Finally, several programs use SSH behind the scenes, and you might want these to use a custom identity. For example, git can access a server using SSH, and I have a custom identity which I use for github. I use a config entry where the "host" is github.com: if I push to a github remote repo it will identify that it is using this host and use the corresponding config entry. Or I could use ssh-agent.

There is one more option with things like git, which is to override their actual SSH command. This needs a tiny bit of caution, but is very useful. For git, you simply set the environment variable `GIT_SSH_COMMAND` to be `ssh -i path/to/private/key`. This is useful for custom identities, although perhaps not as good as the other options.

But while we're here, I'll mention when this is very very useful - when you're having key trouble. You can set the GIT_SSH_COMMAND to `ssh -vv` to get verbose ssh output, which is very useful for debugging.

Other programs may have their own analogue - usually an enviroment variable or config option. See their docs for details.

Heather Ratcliffe : 31 Oct 2018 18:04 | Tags: Git Ssh | Comments (0) | Close comments | Report a problem

October 17, 2018

Searching in data

One of the most common things that you'll want to do in programming is to look in a list of items to find out if a given item is in there. The obvious way of doing it is by going through every item in the list of items and comparing it to see if the items are the same. On average, searching for a random item in a random list of items you'll have to look through half of them before you find the item you want (obviously sometimes you'll find your item quicker than this, sometimes you'll have to look through all of them before you find the one that you want. But on average over a very large number of searches you'll compare half of the list on each search) This algorithm is generally called linear searchboth because you go through your items in a line, one after the other, and because the time that it takes is linearin the number of elements. By this I mean that if you double the number of elements in your list it will take twice as long (on average) to find a given element. In the normal notation of algorithms this is called O(n) "order n" scaling (this is one use of so called "Big O notation"). If doubling the number of elements would take four times as long (quadratic scaling) you'd say that your algorithm would scale as O(n²) "order n squared".

In general, there is no faster way of finding a random element in an unordered list than using linear search. On the other hand, if you have an ordered list then you can speed things up by using bisection. The idea is quite simple. Imagine that you have the first seven entries in the International Radiotelephony Spelling Alphabet (IRSA, don't worry it's probably more familiar than it's name!) in order

Alfa
Bravo
Charlie
Delta
Echo
Foxtrot
Golf

and you want to find the item "Bravo". Obviously, by eye this is easy but you want an algorithm that a computer can follow. You know that you have 7 items, so you compare the item that you want "Bravo" with the one at the half way point "Delta". You know that "Bravo" is before "Delta" in the IRSA and because you know that your list is ordered you can then throw away everything from and after "Delta".You now have

Alfa
Bravo
Charlie

,three items. Once again, check the middle one, which this time is "Bravo" so yay! You've found your target in your list, and it only took two tests despite there being seven items. This is because every time you make a comparison you can throw away half of your list, so doubling the length of your list doesn't double the number of operations that you need, it actually only (on average over all possible lists) adds a single additional operation. Technically this algorithm is O(log(n)) "order log n", which makes it very, very useful if you have a large number of items. Imagine that you have 256 items, you'd have to compare all 128 of them (on average) in linear search compared to 8 (on average) in a bisection search. If you go to 4096 items then it gets even more extreme with 2048 comparisions for linear search compared to 12 for bisection. Similarly if your target item was "Foxtrot" and after "Delta" then you could throw away everything before and including "Delta". By alternatively throwing away either the top or the bottom of your list you always get to the item that you want.

As always though, there are problems. First you have to be able to access any element of your list freely. I quite happily said 'compare the item that you want "Bravo" with the one at the half way point "Delta"' without asking how you go from "half way point" to "Delta". While we haven't covered them yet there are some forms of data structure (notably linked lists which we'll cover soon) where you can't easily do this kind of hopping around inside your list and bisection is usually much less efficient or impractical in these kinds of data structure.

Second, this really does onlywork on ordered lists as you can clearly see by the way in which I just threw away half of my list based on the property of the middle item. You might be tempted to say "ah! I can just sort my list" and indeed you can, but even the best algorithm for sorting a list is O(n log(n)) "order n log n", which in general means that it is just a bit slower than a linear search through your data. What this means is that it's only worth sorting your data and then using bisection to search in it if you're searching in your data much more than you are adding to it, so you can sort your list once and then do many searches on it before it becomes unsorted again when you put new data in. This general idea (although not always the details) is the idea behind indexesin databases. You create ordered lists of data that you want to search on a lot so that you can find the data that you want quickly.

Bisection is a very general approach that you find in all sorts of problems, not just finding items in lists. The general idea is the same : throw away half of you items because you know that the item that you want cannot be in that half.

Christopher Brady : 17 Oct 2018 17:45 | Tags: Programming | Comments (0) | Close comments | Report a problem

October 03, 2018

Pretty please print

The possible impossible

A pretty short entry this week, on a likely familiar theme. You have some buggy code. You try and trace the bug by inserting 'print' statements. Eventually you have prints wrapped around the troublesome piece and you may find yourself saying something like:

Aha! It reaches line A, but never line B! But thats IMPOSSIBLE! There's nothing there but simple code that can't just not execute! In fact there's only blank lines!

As usual, once you find yourself uttering a line like that, it's time to step back. Computers only rarely do the impossible.

Many things can have happened here, but a common, very confusing one, is that the second print has run, but you're not seeing the result. This is common because most languages default to bufferredoutput to stdout. We mentioned this a little bit in context of output streams in a previous entry, here we give a bit more detail.

Aside - stdout

If stdout and stderr aren't already familiar, they are the short names for standard out and standard error. There's a little bit about those herebut basically they are the designated way of writing output (information) and errors to the terminal. They're separate because sometimes you only want to see one or the other. You can direct them to separate files if you want (see the link for how). It is very good practice to respect the designation, and use the correct stream when you write information. It's acceptable to write everything to stdout in your code, but you really shouldn't be writing anything except actual errors to stderr. Warnings are acceptable, but only if they're "important". Save it for things that a user really needs to know.

Back to the Print Problem

The conversation goes something like this:

"So, what's going on with my prints? If it's printing but I don't see it, where is it?"

"It's in an output buffer, waiting until there's enough stuff to be worth printing. "

"Hang on, worth printing? I think it's worth printing! Print it!"

However, there is a good reason. The easiest way to show this is with some code. The following Python3 code uses both buffered output and unbuffered output where we force each print statement to flush the buffer so each print happens exactly as it is reached:

import time
_iters = 100000

def many_prints(flush):

  for i in range(0, _iters):
    print(i, flush=flush)

start_time = time.time()
many_prints(False)
timef = time.time() - start_time

start_time = time.time()
many_prints(True)
timet = time.time() - start_time

print("No flushing --- %s seconds ---" % timef)
print("Flushing --- %s seconds ---" % timet)

Run this a few times and you should see that flushing is slower. It's not always by much, but it can be 20%, or more, depending on system, which is sometimes significant.

Why so Slow?

So why is unbuffered output slow? In this case, mainly because interacting with the terminal and redrawing the screen takes time. Writing to a file can be even slower, because now we have to interact with the file system and disks, and maybe even wait for a success signal to come back. It's much better if these events can be collected into larger bursts, so we only have to do that rigamarole once.

Unbuffered Output and Flushing Options

We showed Python3 above, where we can simply use the flush Keyword. In Python 2 we have to do a little more. This linkshows the main options. We can force the Python interpreter to not use buffering at all, or we can override it for given output streams. Or we can write custom objects that flush exactly when we want.

In Fortran, we just call the FLUSH subroutine either on all outputs or the unit of our choice.

In C, we can flush stdout with fflush, set stdout to be unbuffered, or use stderr which is unbuffered by default.

In C++ there is a slight wrinkle as we can use either C-style IO via stdio, or C++ streamio. std::endl is a command to flush the output stream, not just a new-line, so it's usually recommended to use `'\n' to get just a newline, and std::endl only when required.

C++ IO

While we're here, we should mention on thing about mixed IO in C++. By default, a program can mix stdio via printf and stream-io (using <<), and similarly for input. BUT these two methods have to be force to share any buffering, or else the would not work together. This means they are forced to stay in sync, so inputs and outputs work as expected. In some circumstances, especially if a lot is written to stdout, it may be useful to only use one or other of these, and disable the syncing. See here for details, but be very very careful.

Bottom Line?

The bottom line of all of this is pretty simple. When doing print-style debugging, or printing error messages, either use stderr (making sure that's not buffered in your language or system), or flush after every essential stage of your prints, or disable buffering completely while debugging. You don't need all three. And finally, remember the golden rule. If it looks impossible, you're probably assuming something that isn't true. Not terribly snappy, but a useful thing to remember.

Heather Ratcliffe : 03 Oct 2018 21:16 | Tags: Code Debugging Io | Comments (0) | Close comments | Report a problem

October 02, 2018

Upcoming Training Opportunities

Warwick RSE's autummn term training is now available for signup for any University of Warwick Staff or students, and anybody from the HPC-Midlands-Plus consortium.

This time we have two options.

The first is aimed mainly at Warwick Researchers who wish to use HPC facilities. We'll go through getting access and some essential info you'll want to know, as well as briefly mention where else you can get computing resources.

Secondly, we have a short 3-hour seminar going over all the bits of Software Development researchers should know about. This will be a pretty rapid spin through a lot of tools and words you'll need to know. Hopefully, you'll then spot when you should go and learn more about these things as they come up in your research etc.

For dates, signup etc, see our calendar

Heather Ratcliffe : 02 Oct 2018 17:41 | Tags: Hpc Programming Training | Comments (0) | Close comments | Report a problem

Mo	Tu	We	Th	Fr	Sa	Su
Sep		\| Today \|			Nov
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

All entries for October 2018