All entries for Wednesday 17 October 2018
October 17, 2018
Searching in data
One of the most common things that you'll want to do in programming is to look in a list of items to find out if a given item is in there. The obvious way of doing it is by going through every item in the list of items and comparing it to see if the items are the same. On average, searching for a random item in a random list of items you'll have to look through half of them before you find the item you want (obviously sometimes you'll find your item quicker than this, sometimes you'll have to look through all of them before you find the one that you want. But on average over a very large number of searches you'll compare half of the list on each search) This algorithm is generally called linear searchboth because you go through your items in a line, one after the other, and because the time that it takes is linearin the number of elements. By this I mean that if you double the number of elements in your list it will take twice as long (on average) to find a given element. In the normal notation of algorithms this is called O(n) "order n" scaling (this is one use of so called "Big O notation"). If doubling the number of elements would take four times as long (quadratic scaling) you'd say that your algorithm would scale as O(n2) "order n squared".
In general, there is no faster way of finding a random element in an unordered list than using linear search. On the other hand, if you have an ordered list then you can speed things up by using bisection. The idea is quite simple. Imagine that you have the first seven entries in the International Radiotelephony Spelling Alphabet (IRSA, don't worry it's probably more familiar than it's name!) in order
and you want to find the item "Bravo". Obviously, by eye this is easy but you want an algorithm that a computer can follow. You know that you have 7 items, so you compare the item that you want "Bravo" with the one at the half way point "Delta". You know that "Bravo" is before "Delta" in the IRSA and because you know that your list is ordered you can then throw away everything from and after "Delta".You now have
,three items. Once again, check the middle one, which this time is "Bravo" so yay! You've found your target in your list, and it only took two tests despite there being seven items. This is because every time you make a comparison you can throw away half of your list, so doubling the length of your list doesn't double the number of operations that you need, it actually only (on average over all possible lists) adds a single additional operation. Technically this algorithm is O(log(n)) "order log n", which makes it very, very useful if you have a large number of items. Imagine that you have 256 items, you'd have to compare all 128 of them (on average) in linear search compared to 8 (on average) in a bisection search. If you go to 4096 items then it gets even more extreme with 2048 comparisions for linear search compared to 12 for bisection. Similarly if your target item was "Foxtrot" and after "Delta" then you could throw away everything before and including "Delta". By alternatively throwing away either the top or the bottom of your list you always get to the item that you want.
As always though, there are problems. First you have to be able to access any element of your list freely. I quite happily said 'compare the item that you want "Bravo" with the one at the half way point "Delta"' without asking how you go from "half way point" to "Delta". While we haven't covered them yet there are some forms of data structure (notably linked lists which we'll cover soon) where you can't easily do this kind of hopping around inside your list and bisection is usually much less efficient or impractical in these kinds of data structure.
Second, this really does onlywork on ordered lists as you can clearly see by the way in which I just threw away half of my list based on the property of the middle item. You might be tempted to say "ah! I can just sort my list" and indeed you can, but even the best algorithm for sorting a list is O(n log(n)) "order n log n", which in general means that it is just a bit slower than a linear search through your data. What this means is that it's only worth sorting your data and then using bisection to search in it if you're searching in your data much more than you are adding to it, so you can sort your list once and then do many searches on it before it becomes unsorted again when you put new data in. This general idea (although not always the details) is the idea behind indexesin databases. You create ordered lists of data that you want to search on a lot so that you can find the data that you want quickly.
Bisection is a very general approach that you find in all sorts of problems, not just finding items in lists. The general idea is the same : throw away half of you items because you know that the item that you want cannot be in that half.