December 27, 2006

Discovering Important Bloggers Based on a Blog Thread Analysis

The paper on which this presentation was based sets out to capture hot topics of conversation by identifying those bloggers who take a particularly important role in these conversations. To do this, two types of blog-user are defined. Agitators are defined as those users who stimulate discussions and Summarizers are those who simply summarise discussions.
The paper discusses, in a fair amount of detail, the types of hyperlinks created between different blogs and between blogs and websites. An attempt is also made to formally define aspects of blogs in some way. For example, a blog site itself is defined in the document as follows:

site = (siteURL,RSS, blogger+, siteName, entry+)

This simply says that a blog site consists of a site URL, RSS (really simple syndication), an associated set of bloggers who maintain the blog, a site name and a set of entries. RSS is a set of web feed formats used to distribute frequently updated web content. Programs called feed ‘agitators’ can be used to subscribe to RSS information so that the user is informed when this information is updated.
More importantly to the content of the paper, the following definitions are given for types of links that can be included in a blog entry:

replyLink = (ei, ej), (ei → ej )
trackbackLink = (ei, ej), (ei → ej )
sourceLink = (ei, wi), (ei → wi)

Where ei, ej ∈ E , E is a set of blog entries, and wi ∈ W , W is a set of Web pages except blog entries.
Reply links and source links are non-automatic inclusions and point to relevant blogs or non-blog websites respectively. A trackback link is a special type of source link which also includes a link (ej → ei ) in the other direction. These were also described in earlier presentations.
The paper also introduces ‘blog threads’. A blog thread is simply a directed graph with a set of nodes representing blogs and websites and edges representing source links and reply links. The system described involves first extracting blog threads from blog data and then identifying important bloggers from the threads.
Initially, the system crawls through OPML files to examine different RSS feeds (blog sites). These feeds are added to an RSS list which is then crawled through by the system, extracting the site names, URLs and other relevant information related to the RSS feeds (shown in the formal definition above).
Next, the hyperlinks are extracted from the blog sites by examining the HTML. The HTML tag structure may be different for each blog server so this must be determined first. The hyperlinks are then split into source links and reply links to create blog threads.
Following this, the system must detect important bloggers from blog threads. If the system judges that threads generally grow after a particular blogger creates and entry, then that blogger is deemed to be an Agitator. If the system judges that a blogger creates entries that generally refer to other entries then that blogger is deemed to be a Summarizer.
The paper goes on to describe three ways for discriminating a blog entry ex to characterize the associated blogger as an Agitator.
The first of these is a link-based discriminant. If kx is the number of entries in threadi with a reply link to ex then ex is an entry by and Agitator if: (kx) > θ1
The second method is a popularity-based discriminant which identifies ex as an Agitator entry if the number of entries published in threadi increases a certain amount after the publication of ex. This is determined using the following discriminant:

(lx/mx) > θ2

Where lx is the number of entries in threadi published within a certain time frame after the publication of ex and mx is the number published beforehand.

The final method for identifying Agitators is a topic-based method. This method looks for a change in topic by calculating feature vectors, done so by the term frequency values of the descriptions for the blog entries. Similarity between feature vectors therefore indicates similarity in the text of two entry descriptions. The final discriminant features characters that I can’t publish in this blog entry but can be found in the original text from the module website.
Only a link-based discriminant is given for identifying Summarizers. If px is the number of entries in threadi that have a reply link from ex then ex is a Summarizer entry when (px) > θ4.

It seems that the authors of the paper had some interesting ideas for discovering important topics in blogspace and such information could be useful for certain purposes. For example, media companies could use these methods to find what topics are most important to people at a given time and choose to cover these topics in more detail. The methods could also be useful for blog search engines, to identify topics of interest.
It was noted by the presentation group that the paper covers a small sample base, however, and the methods may be very impractical for huge blogspaces. Also, the context of blog entries is not considered, nor are comments associated with these entries. However, the ideas here could be developed to be very effective for certain applications.

November 30, 2006

Mapping the Blogosphere in America

This presentation was largely about finding geographical information related to blogs and the topics they cover. The paper that was discussed is the beginning of a long-term research plan to investigate localized attitudes, political agendas, urban mentalities and other such social information in American cities. In order to do this it is necessary to extract geographical information about blogs to see where the authors are based. So how can this be done? Some methods were presented to the class:

  • The registrants address may be located in the hosting domain’s registry.
  • The location of the owner may be given in their profile, which can be created along with a blog.
  • It may be that the owner has registered their blog with a some local host or website. For example, is a blog host for people in the city of New York.
  • The owner of the blog may have published a link to their CV or Biography which contains their address.
  • There may also be links given to local weather information, schools or other geographically relevant sites.

It was mentioned that the authors of the paper are working on automatic methods for extracting geographical data from blogs. The methods the algorithm uses are as follows:

  • Find GeoURL metadata if it exists – This is when data about the location has been embedded into the HTML of a site but is not visible.
  • Whois query
  • Profile Information – As aforementioned, this involves checking the registrant’s profile for geographical information.
  • Blog Chalking – This is a way of categorising blogs based on interest, regional information and other such things. It is also done to register blogs with major search engines. Described as a ‘tattoo for your blog’ it can be used to find the geographical location of a blog owner.
  • Text on index page – The text on the index page may contain references to local areas and landmarks.

The next issue raised is that of standardizing geolocation data. The information gathered from blogs may vary drastically in accuracy, with some down to a 9-digit zip code and others just a city name. The proposition made is to use the first 3 digits of the zip code to identify geolocations. The zip codes work in the following way: the first digit represents a general area of the US with 0 being the north-east and nine being the west. Subsequent digits divide the area right down to the nearest local post office. The first three digits will divide the area down into metropolitan areas such as Los Angeles or a cluster of small towns and villages.
The results obtained show, as expected, that the number of bloggers is proportional to population and areas of high socio-economic status. The paper does not give a relationship between topics and areas since this particular document is just a starting point for a long-term investigation.

So what are the limitations of the information found in this paper? Some were presented as below.

  • People give inaccurate or false information in their profiles – This may not be deliberate. For example, I may say that I live in Birmingham when in fact I live about 20 minutes away.
  • Using 3-digit codes overstates the number of bloggers in metropolitan areas.
  • 3-digit codes group small towns into one unit although these areas may have no social cohesion whatsoever.

The suggestion is therefore made to divide the country into areas based on socio-economic profile. Thus, areas might be separated based on average household income.

The presentation then went on to discuss topics presented in the paper in more detail. This began with the notion of a Geographic Information System (GIS) which is a system for storing and displaying geographical information. The two issues associated with gathering information for a GIS were given as:

  • Geoparsing – Identifying geographical information from text.
  • Geocoding – Converting this information in geographical coordinates.

Some uses of finding the geolocations of bloggers were given to be:

  • Finding sociological and political trends – Mapping the ‘buzz’ of what is going on in certain areas.
  • For advertising – An example was given of a comparison between locations of a specific restaurant chain and locations of bloggers who mention that chain in their blogs. Thus the marketing department may wish to launch more advertising campaigns in those areas that are not discussing their restaurant.

Other mapping methods were then discussed, not necessarily in relation to geographical locations. One of these was the mapping of hyperlinks in hyperbolic space. This allows the investigation of outgoing and incoming links to websites and blogs via a visual representation.
Self-organising maps were also discussed. These maps, unlike those in hyperbolic space, are 2 dimensional and are built based on common links between blogs. This allows communities to be observed by looking at groups of blogs that link together.

November 29, 2006

A Short Walk in the Blogistan

The paper for discussion in presentation number 3 is quite generalized in terms of its subject matter. The term ‘Blogistan’ simply refers to the entire collection of blogs that exists. The paper’s abstract states that it will explore three aspects of the Blogistan:

  • Its overall scope and size
  • Identification of hot topics of discussion and link patterns
  • Implications both to blogs and applications such as search

Gathering Data

The authors of the paper used data from websites that rated popularity of other sites. All URLs with containing the word ‘blog’ were taken and then duplicates and obvious non-blog URLs were removed. Also, blogs that had not changed within the preceding few weeks were removed. Blogs were fetched 5 times a day for one month.
In addition to the main body of information contained in the blogs, meta information was also gathered. In the second phase of URL gathering, as opposed to the first, links were also extracted from the blogs to see how blog collections differ from web sites.

The paper uses the terms ‘new URL’ and ‘old URL’. A ‘new URL’ is one that is referenced in any blog under examination at least 24 hours past the start date of data gathering. Other URLs are deemed to be old. Only new URLs were considered for emerging interests. Also, only URLs deemed to be ‘interesting’ were extracted. An ‘interesting URL’ is one that has a relatively large number of references. The number of references to a URL is called multiplicity.

A large amount of the paper is dedicated to data gathering without any conclusions. The authors seem interested in what data should be considered relevant within the Blogistan. Thus, they talk about only examining blogs with high multiplicity and other such things.


Of those blogs examined in the first-phase, 33.5% had not been updated in 2 months, perhaps suggesting that a fairly large fraction of current blogs have actually died.
Unlike with web sites, millions of blogs are distributed across not that many hosting domains. The data gathered showed around 180,000 domains, but only 11,870 IP addresses are associated with these, due to aliases. This is a suprisingly small amount considering the huge number of active blogs that exist.
It is noted that: popular websites have more references than blogs; blogs have more references than less popular sites; and blogs have more self references than websites, which is perhaps not unsurprising.
A discussion is given of server issues that may be faced with regards to blogs. The importance of the HTTP Range request is emphasised. This header allows the request to consider just a portion of the data, so that only new data should be retrieved. Range requests should therefore reduce the traffic associated with popular blogs. However, the data collected showed only about 40% of blog servers are able to hangle range requests.
Having discussed the above, the presenters talked about the spamming problems with blogs. In addition to blogs themselves being victims to spamming, phenomenons known as ‘splogs’ have also emerged. These are fake blog pages generated with arbitrary content on them. However, systems are emerging that can detect splogs up to around 90% of the time.

Although the paper itself does not seem to mention this, the presentation mentioned the idea of there being three types of key bloggers: summarizers, agitators and topic-finders. Summarizers link to lots of other blogs and web sites. Agitators are those who create drastic changes in the topics within a thread. A definition of topic finders is hard to locate but they are presumably those who post based entirely upon certain topics of interest.

This paper is quite hard to get a grasp of due to its large number of data references. If anyone has anything to add please let me know, this one’s quite hard to summarize.

A Matter of Life and Death – Modelling Blog Mortality

This presentation, which took place on Friday November 24th, was to do with reasoning behind the death of blogs. Definitions for “death” in this context seem to varied but the term generally refers to a blog that has recieved its last entry and is now dormant, or has been removed by the provider. Initially, reasons behind starting a blog were examined:

  • Creative Expression – Some people may start a blog to display their poetry, art or other such emotive creations.
  • Journal – A personal record of someone’s experiences. This may be entirely private, which is an option given for each blog entry made.
  • Communication Between Friends and Family – In this sense, a blog may be used as a private forum where family photos may be shared for example.
  • Make Money – A blog may be used to display product information and act as a retail tool.
  • Meet New People – Due to the way that blogs can be interconnected, via blogrolls for example, and the commenting systems therein, it is easy to meet new people and discuss certain topics with them. Looking around Warwick blogs, communities of people have clearly developed purely around this system.
  • Income – Blogs can actually be used to generate income. For example through the use of banner ads.

The presenters mentioned LiveJournal as a source for blog data.
Within this paper, an expression is given for the number of active blogs, x, on any given day:

t is the number of active blogs the day before.
m is the chance of a blog not surviving the night.
d = 1-m is the chance of a blog surviving.
n is the number of new blogs created on the given day.
x = dt + n

It was noted by those presenting that the paper is based an a lot of generalisations and the accuracy of any conclusions is questionable.

The next topic presented was that of blog deaths and the reasons behind them. Examples were given to be:

  • Lack of time to maintain – The blog owner has too much work to do (perhaps due to getting a new job) or family life takes over due to the birth of a new baby for example.
  • Lack of results – No-one read the blog or comments on entries and so the owner does not see any point in carrying on (with the blog).
  • Writer’s block – The blog owner runs out of things to write about and decides to give up.
  • Rhythm break – If a writer has a certain rhythm to his post, perhaps because he always posts at a certain time of day, the readers of his blog will be accustomed to checking after this time to read any posts. If the posting rhythm is interrupted, the disruption to the readers may lead to less visitors. As a results of this, the owner may wish to stop.
  • Thrill is gone – Some blogs are started for the novelty value and die once this wears off.
  • Change of interest – Perhaps the owner becomes interested in another topic which does not relate to that of his blog. This change of interest may lead to the abandonment of the current blog.
  • Unpleasant comments – As illustrated in the presentation, some bloggers get abused and spammed via their blogs and may wish to stop as a result.
  • Other reasons – Priorities changing, people moving to other mediums, the loss of a username and password etc. This blog will probably die quite quickly after the presentations are over.

November 23, 2006

Tracking Information Epidemics In Blogspace

Today’s presentation concerned the tracking of specific data through blogspace using the analogy of infections. The idea of ‘memes’ was introduced, which are essentially units of information passed from one mind to another. An ‘infected’ blog is therefore one that contains a specific meme, such as a URL. The example given in the paper involves an attempt to locate the source of the URL,
There are certain problems associated with attempting to track URLs or memes, through blogspace. Firstly, not every infected blog may have seen the source and there is difficulty in finding where a URL may have originated for any particular blog.
One helpful blog attribute to aid meme tracking is the blogroll. This is simply a collection of links to other weblogs usually presented on the front sidebar of a blog, seen to the left of these entries. Another possibility is that the information is given an associated link to the source, known as a via link. This is very rare however.
Blog users can specify ‘trackbacks’ when making a blog entry. This is a link to the blog that is referred to by this entry. i.e. it is a link to the local source of the information. Trackbacks can create bidirectional links between blogs. That is, if I create a trackback link in this entry referring to Blog B, then a link will automatically be produced at Blog B to this blog.
Unfortunately, links are not always easy to infer using these methods and so classifiers may be used.

The presentation contained descriptions of methods for inferring infection routes. This summary will not include the equations given therein. The first of these methods looked at the similarity between URLs published on blogs. This similarity function used the numbers of URLs on blogs A and B, and the number of shared URLs between A and B. The similarity was computed as the number of shared URLs divided by the total of the square root of the number of URLs in blog A over the square root of those in blog B. It was shown that linked blogs commonly share a number of URLs.
Another method is to look at textual similarity between blogs. This is done by analysing common words with relation to the size of the text being analyzed. Obviously, if there are more common words in a smaller amount of text then the two are more similar.
Also, the time at which blog entries are made can be observed. Therefore connections can be made between timings and infection sources. If blog A consistently cites a URL before blog B, then it’s fair to assume that A was the source of infection for B.
These methods can be used to build classifiers to (supposedly) automatically detect infection routes. Two classifiers were mentioned: one which detects bidirecitonal links, unidirectional links and unlinked pairs; and another which just distinguishes between linked and unlinked pairs. The latter of these was deemed more accurate.

Also mentioned in this presentation were the visualisations of the infection routes via directed acyclis graphs. The paper displays some of these graphs showing the spread of information through a small area of blogspace.
The presentation concluded with a look at some of the problems associated with this work. The most major of these seemed to be that classifiers can only examine a very small area of blogspace. Although the information may seem large, the ‘blogosphere’ is extremely vast in comparison. The incompleteness of crawls can also be questioned. In addition to this, the robustness of the classifiers was emphasised to be disputable.
However, the work was deemed to have produced some useful visualization tools and provided a unique way of using infection properties to analyze the spread of information. The authors also claim that the work could be used for search engines to locate information sources but this seems dubious.

Please comment on any innaccuracies or additions you wish to make because I’m sure this is quite incomplete.

May 2020

Mo Tu We Th Fr Sa Su
Apr |  Today  |
            1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Search this blog



Most recent comments

  • Isn't that Tony Hart of Morph fame riding that Llama? by on this entry
  • The llama song: by Llama on this entry

Blog archive

RSS2.0 Atom
Not signed in
Sign in

Powered by BlogBuilder