The paper on which this presentation was based sets out to capture hot topics of conversation by identifying those bloggers who take a particularly important role in these conversations. To do this, two types of blog-user are defined. Agitators are defined as those users who stimulate discussions and Summarizers are those who simply summarise discussions.
The paper discusses, in a fair amount of detail, the types of hyperlinks created between different blogs and between blogs and websites. An attempt is also made to formally define aspects of blogs in some way. For example, a blog site itself is defined in the document as follows:
site = (siteURL,RSS, blogger+, siteName, entry+)
This simply says that a blog site consists of a site URL, RSS (really simple syndication), an associated set of bloggers who maintain the blog, a site name and a set of entries. RSS is a set of web feed formats used to distribute frequently updated web content. Programs called feed ‘agitators’ can be used to subscribe to RSS information so that the user is informed when this information is updated.
More importantly to the content of the paper, the following definitions are given for types of links that can be included in a blog entry:
replyLink = (ei, ej), (ei → ej )
trackbackLink = (ei, ej), (ei → ej )
sourceLink = (ei, wi), (ei → wi)
Where ei, ej ∈ E , E is a set of blog entries, and wi ∈ W , W is a set of Web pages except blog entries.
Reply links and source links are non-automatic inclusions and point to relevant blogs or non-blog websites respectively. A trackback link is a special type of source link which also includes a link (ej → ei ) in the other direction. These were also described in earlier presentations.
The paper also introduces ‘blog threads’. A blog thread is simply a directed graph with a set of nodes representing blogs and websites and edges representing source links and reply links. The system described involves first extracting blog threads from blog data and then identifying important bloggers from the threads.
Initially, the system crawls through OPML files to examine different RSS feeds (blog sites). These feeds are added to an RSS list which is then crawled through by the system, extracting the site names, URLs and other relevant information related to the RSS feeds (shown in the formal definition above).
Next, the hyperlinks are extracted from the blog sites by examining the HTML. The HTML tag structure may be different for each blog server so this must be determined first. The hyperlinks are then split into source links and reply links to create blog threads.
Following this, the system must detect important bloggers from blog threads. If the system judges that threads generally grow after a particular blogger creates and entry, then that blogger is deemed to be an Agitator. If the system judges that a blogger creates entries that generally refer to other entries then that blogger is deemed to be a Summarizer.
The paper goes on to describe three ways for discriminating a blog entry ex to characterize the associated blogger as an Agitator.
The first of these is a link-based discriminant. If kx is the number of entries in threadi with a reply link to ex then ex is an entry by and Agitator if: (kx) > θ1
The second method is a popularity-based discriminant which identifies ex as an Agitator entry if the number of entries published in threadi increases a certain amount after the publication of ex. This is determined using the following discriminant:
(lx/mx) > θ2
Where lx is the number of entries in threadi published within a certain time frame after the publication of ex and mx is the number published beforehand.
The final method for identifying Agitators is a topic-based method. This method looks for a change in topic by calculating feature vectors, done so by the term frequency values of the descriptions for the blog entries. Similarity between feature vectors therefore indicates similarity in the text of two entry descriptions. The final discriminant features characters that I can’t publish in this blog entry but can be found in the original text from the module website.
Only a link-based discriminant is given for identifying Summarizers. If px is the number of entries in threadi that have a reply link from ex then ex is a Summarizer entry when (px) > θ4.
It seems that the authors of the paper had some interesting ideas for discovering important topics in blogspace and such information could be useful for certain purposes. For example, media companies could use these methods to find what topics are most important to people at a given time and choose to cover these topics in more detail. The methods could also be useful for blog search engines, to identify topics of interest.
It was noted by the presentation group that the paper covers a small sample base, however, and the methods may be very impractical for huge blogspaces. Also, the context of blog entries is not considered, nor are comments associated with these entries. However, the ideas here could be developed to be very effective for certain applications.