Tracking Information Epidemics In Blogspace
Today’s presentation concerned the tracking of specific data through blogspace using the analogy of infections. The idea of ‘memes’ was introduced, which are essentially units of information passed from one mind to another. An ‘infected’ blog is therefore one that contains a specific meme, such as a URL. The example given in the paper involves an attempt to locate the source of the URL, www.giantmicrobes.com.
There are certain problems associated with attempting to track URLs or memes, through blogspace. Firstly, not every infected blog may have seen the source and there is difficulty in finding where a URL may have originated for any particular blog.
One helpful blog attribute to aid meme tracking is the blogroll. This is simply a collection of links to other weblogs usually presented on the front sidebar of a blog, seen to the left of these entries. Another possibility is that the information is given an associated link to the source, known as a via link. This is very rare however.
Blog users can specify ‘trackbacks’ when making a blog entry. This is a link to the blog that is referred to by this entry. i.e. it is a link to the local source of the information. Trackbacks can create bidirectional links between blogs. That is, if I create a trackback link in this entry referring to Blog B, then a link will automatically be produced at Blog B to this blog.
Unfortunately, links are not always easy to infer using these methods and so classifiers may be used.
The presentation contained descriptions of methods for inferring infection routes. This summary will not include the equations given therein. The first of these methods looked at the similarity between URLs published on blogs. This similarity function used the numbers of URLs on blogs A and B, and the number of shared URLs between A and B. The similarity was computed as the number of shared URLs divided by the total of the square root of the number of URLs in blog A over the square root of those in blog B. It was shown that linked blogs commonly share a number of URLs.
Another method is to look at textual similarity between blogs. This is done by analysing common words with relation to the size of the text being analyzed. Obviously, if there are more common words in a smaller amount of text then the two are more similar.
Also, the time at which blog entries are made can be observed. Therefore connections can be made between timings and infection sources. If blog A consistently cites a URL before blog B, then it’s fair to assume that A was the source of infection for B.
These methods can be used to build classifiers to (supposedly) automatically detect infection routes. Two classifiers were mentioned: one which detects bidirecitonal links, unidirectional links and unlinked pairs; and another which just distinguishes between linked and unlinked pairs. The latter of these was deemed more accurate.
Also mentioned in this presentation were the visualisations of the infection routes via directed acyclis graphs. The paper displays some of these graphs showing the spread of information through a small area of blogspace.
The presentation concluded with a look at some of the problems associated with this work. The most major of these seemed to be that classifiers can only examine a very small area of blogspace. Although the information may seem large, the ‘blogosphere’ is extremely vast in comparison. The incompleteness of crawls can also be questioned. In addition to this, the robustness of the classifiers was emphasised to be disputable.
However, the work was deemed to have produced some useful visualization tools and provided a unique way of using infection properties to analyze the spread of information. The authors also claim that the work could be used for search engines to locate information sources but this seems dubious.
Please comment on any innaccuracies or additions you wish to make because I’m sure this is quite incomplete.