#
All entries for June 2018

## June 27, 2018

### More randomness

Dealing with randomness and random samples of distributions is a huge area, but we're going to finish these posts by considering one last big thing that we haven't touched on at all : random distributions in multiple variables. These look at the question of what happens if you have a set of related variables where the probability of one having a given value of one variable affects the probability of having a given value in another variable. You do have to be a little bit careful though because not all probability functions in more than one variable have this property. A classical example from physics of one that **doesn't** is the distribution of velocities of particles in a gas.

Particles in a gas are free to move in any direction that they want to, so they have three velocity components since the universe is has 3 spatial dimensions (at least to a good enough approximation for this problem). In fact, the distribution of velocities is given by the Maxwell-Boltzmann distribution, which is basically the Gaussian or Normal distribution that we saw earlier. Mathematically, this looks like

Where Vx, Vy and Vz are the x, y and z components of the particle's velocity, m is the mass of the particle, k is a constant called Boltzmann's constant and T is the temperature in Kelvin (Which is the same as centigrade but offset so that 0C is 273.15K. 0K is "absolute zero" which doesn't exist in nature, so it's fine that this equation breaks at T=0). If you've not seen it before the alpha symbol rather than = means that I'm hiding a constant multiplier in the equation. Using the properties of exponentials we can rewrite this as

Imagine now that you have chosen Vx. That first exponential then just becomes a constant and can be absorbed into the constant that I'm hiding in that alpha symbol so the distribution of Vy and Vz aren't changed by selecting a Vx. That means that I can simply choose Vx, Vy and Vz independently and since they are all Normally distributed I can use the Box-Muller transform that I mentioned in the previous randomness post so I can sample it very fast. Always look at the mathematical description of your distribution to see if you can break it down like this into simpler forms (in more technical language this is termed "seperability" of the equations). It's always faster to sample on a smaller number of variables if at all possible.

But there are some distributions that really can't be separated like this, for example consider

as far as I know this isn't a distribution that anyone is interested in, but it certainly could exist. You can't separate this into functions of Vx, Vy and Vz so you have to sample it in all three variables directly. In the previous post we mentioned inverse transform sampling. This involves forming the cumulative distribution function from the distribution function and sampling along that curve. You can in theory do this for multdimensional functions as well. You have to define a space-filling curve that covers the area of interest of the function and then integrate the PDF along this curve and then finally do inverse sampling along this curve. Doing inverse transform sampling analytically is already tricky and working along an arbitrary curve really doesn't help matters, so in general you have to do it numerically. This means that you have to define your function on a grid of some kind. The problem is that this grid has to have as many dimensions as you have variables, so in this case there's Vx, Vy and Vz so 3D. The more points you have in each dimension of your grid the better your numerical approximation is to the real distribution (you can do things like interpolating between grid points to make it better than the simplest implementation but this is still true overall). So if you want to have 1000 grid points in each direction for this 3D example you'd need between 4 and 8 GB of memory (depending on how you store your numbers, which also affects accuracy). That's possible on a pretty typical computer nowadays, but still a lot of memory. If you had an even harder 4 variable sampling function then this rises to 4-8 TB of memory which is the preserve of high end workstations costing tens of thousands of pounds, and by 6 variable sample functions it couldn't even be run on world leading supercomputers.

Fortunately there is an alternative called rejection sampling. It's much simpler to implement too, although the downside is that it can be much slower to run. You can comparatively easily change any distribution function so that it has a maximum value of 1. Simply find the maximum value in the range of interest of all variables (either analytically or numerically) and divide the distribution function by this value. What you have now is often called the "acceptance function". That is, if I randomly select a value for each variable (selecting this value as a uniform random number in some range of interest), what is the probability that I should accept this value and include it in my sample? This is why the you want a function that has a maximum value of 1 - you want there to be an "optimal" place where acceptance is guaranteed. Then you simply roll one more random value and if it is less than the random probability that you get from the acceptance function then you should take that value, add it to the list of values and continue. If it isn't then you just randomly select new values **of all the variables**and keep going until the value is accepted. When you have accepted enough values for your purpose you just stop sampling.

This works beautifully and doesn't require that you allocate any arrays for your distribution function, so long as you can calculate it for any possible value of your variables. Since there's no array there's no problem with discretising the results and you get a nice smooth sample of your distributions function. The problem is in all of the times that you have to throw away sample points because they are rejected. This means that in general rejection sampling is quite slow. While there's no need to, pretend that you want to sample a Normal distribution in a single variable using rejection sampling. If you want to sample it out to one standard deviation from the mean then the average acceptance probability is 86% (ish), so it'll take about 16% longer to compute than it would to be able to simply specify the sample value directly (this isn't quite true because Box-Muller is quite complex and takes longer than a simple rejection sample, but I'm going to ignore this here). Unfortunately there are very few examples of problems where one standard deviation would be enough. In fact, a normal distribution to 1 standard deviation looks quite odd

so you generally have to go further out. At 2 standard deviations you're now accepting about 60% of trials, so you're taking about 66% longer than simple sampling. At 3 standard deviations you're at 42% average acceptance and you're talking about taking 2.4 times longer than a simple sample. At 4 standard deviations you're now down at 31% average acceptance and you've having to do 3 trials for an average sample until you accept it. Unsurprisingly this takes about 3 times longer than simply being able to specify the value, but finally your Normal curve looks about right by eye. As you can see, rejection sampling really should be your last resort since Normal curves are about as benign as distribution functions get - most functions that you'd actually want to run acceptance sampling on have lower acceptance probabilities by the time that you are sampling everything that you want to.

There are more advanced algorithms that are better than rejection sampling (try looking at the Metropolis-Hastings algorithm if you want a place to start), so you're not completely out of luck if you've reached this point and rejection sampling is too slow, but things do start getting more complex from here on in.

## June 13, 2018

### Fix Me

## Fancy Latex Customisation

Writing up notes and papers, I often find that I want to leave myself reminders, hints and other snippets that I'd rather didn't make it into the final draft. Or I want to collate notes from throughout my text into a file I can put in my appendix, such as a list of future work. I often think of Latex as just a typesetting language, but Tex itself is actually a fully Turing complete language and Latex isn't too far off. The hardest bit of Turing completeness is usually the ability to do conditional branches which Latex can do with a bit of work. Thankfully there's a common package that takes some of the load off us.

* NOTE: *many journals wont accept manuscripts containing custom commands, so don't forget to remove your commands from papers you submit in future.

The Latex snippets are available on our Github page and are described below in more detail.

## Defining Latex Commands

Firstly, we use the Latex \newcommand to do some custom formatting we can easily vary, in this case to make our text bold. \newcommand is fairly simple. The syntax is

\newcommand{\<name>}[<n_args>]{<command>}

where the parts in <> should be replaced with appropriate text. Thist creates a command called "name", taking "n_args" arguments and applies "command". The arguments are referred to using a '#' within your command, as in the following example where we feed the text inside the \fbrtext braces into \textbf.

\newcommand{\fbrtext}[1]{\textbf{#1}}

If there are no arguments, you can omit the [<n_args>] part entirely.

We can define as many commands this way as we like, but they must not be already defined. To redefine something, use \renewcommand (but carefully, making sure not to clobber something important like \section).

## Collating Snippets with Output Streams

The first snippet is to collate some things into a file so we can review things we'd like to fix or add to our document before we release it. Snippet 1 is a simple Latex document which does this. We put the commands in a .tex file named ReleaseFixes, and include this into our main document, so that we can reuse it easily.

The core part of this is a package called "newfile" which allows you to write to files as the Latex compilation runs. This is how things like index and glossary packages work at heart, creating a temporary file that can then be included. This package lets us create "streams" for output, that we can write to, which may be familiar from Java, C++ or other languages. We name our stream "releasefixes", and then open a file to which it should output.

Now we create the main command, which takes one argument, a text chunk. This chunk will appear in our document in **bold red** text, and will also appear in our output file. In the output file, it'll show the line number (in the input .tex file) where the text appears. We could also include the page number in the output pdf if we wanted or other such useful commands.

## Including Snippets into our Document

Perhaps we want to include the snippets we've generated into an appendix, so that we can send the draft to somebody else and they have a handy list of all the changes we want to make, or things we need filling in. Or maybe we want to use a similar command to list the changes we've made to a draft for easy review.

In Snippet2 we adjust our command so that rather than creating a simple text file, we put a '\item' command at the start of every item. Then we add a section to our document, and include the file we generated, inside an \itemize environment. This should give a neat bullet-pointed list of things to fix.

This doesn't quite work yet: the stream seems to be empty at the point where we read it, even though the file is complete when we look afterwards. When you have an unexpectedly empty output file it is usually because the stream didn't get "flushed". Flushing an output stream means moving its contents from a buffer in memory to the actual output file. This could be done every time we write to the stream, but if we do lots of tiny writes it is much slower this way than waiting until we have "enough" and writing in a chunk.

Normally, output flushing happens when the compile ends and everything gets closed, but we want access to our file before this. So we insert an explicit \closeoutputstream in just before our new section to write the file immediately, and then we can open it for input. We include its contents, and they are then typeset just like a static included file.

## Using *ifthen* to Typeset Variants

Suppose that rather than showing us the changes, we want a way to include or exclude them from the output file entirely. This is really easy to do with a quick edit, because we just redefine our \fbr command to omit the text instead, and compile again. Easy for one command, but what if we have several things we want to change for our "draft" mode? It's tedious and error prone to edit them all.

A perfectly good solution for this precise case, by the way, is to create two versions of our "ReleaseFixes.tex" file, one for draft mode, and one for release mode, and edit our file to swap between them

\include{ReleaseFixes_draft.tex} %Swap with next line for release

%\include{ReleaseFixes_release.tex}

However, this is a good demonstration of conditionals, so let's try that. In Snippet3 we use the package "ifthen" which provides an "ifthenelse" command, that looks like

\ifthenelse{<condition>}{<command if true>}{<command else>}

The package allows us to define boolean flags. We define one called usered which chooses whether our text should be output in bold or red-bold.

**However** we can only define flags after we've include the package, and we'd prefer the draft mode selection it in our main document. So we define a normal command, \draft, and check if this is defined instead. We adjust the \fbr command to do nothing in release mode (\draft not defined) or to give us bold text (red if the flag says so) and a file entry in draft mode. Then by un/commenting line 6 of Document.tex we control the final form.

## Summary

Latex typesetting gets a lot easier once you have a few simple commands at hand. It gets a * lot* more powerful once you add the ability to do conditionals. "Clever" commands like this should be used with caution, since anybody who wants to compile the document needs the relevant packages, but they are very handy when you need them, and can save a lot of time and effort. And do remember that you'll probably have to remove anything like this if you submit Latex to journals.