November 29, 2006

A Short Walk in the Blogistan

The paper for discussion in presentation number 3 is quite generalized in terms of its subject matter. The term ‘Blogistan’ simply refers to the entire collection of blogs that exists. The paper’s abstract states that it will explore three aspects of the Blogistan:

  • Its overall scope and size
  • Identification of hot topics of discussion and link patterns
  • Implications both to blogs and applications such as search

Gathering Data

The authors of the paper used data from websites that rated popularity of other sites. All URLs with containing the word ‘blog’ were taken and then duplicates and obvious non-blog URLs were removed. Also, blogs that had not changed within the preceding few weeks were removed. Blogs were fetched 5 times a day for one month.
In addition to the main body of information contained in the blogs, meta information was also gathered. In the second phase of URL gathering, as opposed to the first, links were also extracted from the blogs to see how blog collections differ from web sites.

The paper uses the terms ‘new URL’ and ‘old URL’. A ‘new URL’ is one that is referenced in any blog under examination at least 24 hours past the start date of data gathering. Other URLs are deemed to be old. Only new URLs were considered for emerging interests. Also, only URLs deemed to be ‘interesting’ were extracted. An ‘interesting URL’ is one that has a relatively large number of references. The number of references to a URL is called multiplicity.

A large amount of the paper is dedicated to data gathering without any conclusions. The authors seem interested in what data should be considered relevant within the Blogistan. Thus, they talk about only examining blogs with high multiplicity and other such things.

Inferences

Of those blogs examined in the first-phase, 33.5% had not been updated in 2 months, perhaps suggesting that a fairly large fraction of current blogs have actually died.
Unlike with web sites, millions of blogs are distributed across not that many hosting domains. The data gathered showed around 180,000 domains, but only 11,870 IP addresses are associated with these, due to aliases. This is a suprisingly small amount considering the huge number of active blogs that exist.
It is noted that: popular websites have more references than blogs; blogs have more references than less popular sites; and blogs have more self references than websites, which is perhaps not unsurprising.
A discussion is given of server issues that may be faced with regards to blogs. The importance of the HTTP Range request is emphasised. This header allows the request to consider just a portion of the data, so that only new data should be retrieved. Range requests should therefore reduce the traffic associated with popular blogs. However, the data collected showed only about 40% of blog servers are able to hangle range requests.
Having discussed the above, the presenters talked about the spamming problems with blogs. In addition to blogs themselves being victims to spamming, phenomenons known as ‘splogs’ have also emerged. These are fake blog pages generated with arbitrary content on them. However, systems are emerging that can detect splogs up to around 90% of the time.

Although the paper itself does not seem to mention this, the presentation mentioned the idea of there being three types of key bloggers: summarizers, agitators and topic-finders. Summarizers link to lots of other blogs and web sites. Agitators are those who create drastic changes in the topics within a thread. A definition of topic finders is hard to locate but they are presumably those who post based entirely upon certain topics of interest.

This paper is quite hard to get a grasp of due to its large number of data references. If anyone has anything to add please let me know, this one’s quite hard to summarize.


- No comments Not publicly viewable


Add a comment

You are not allowed to comment on this entry as it has restricted commenting permissions.

Trackbacks

November 2006

Mo Tu We Th Fr Sa Su
|  Today  | Dec
      1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30         

Search this blog

Tags

Galleries

Most recent comments

  • Isn’t that Tony Hart of Morph fame riding that Llama? by Alastair Smith on this entry
  • The llama song: http://www.albinoblacksheep.com/flash/llama.php by Llama on this entry

Blog archive

Loading…
Not signed in
Sign in

Powered by BlogBuilder
© MMXII