October 12, 2005

Latent Semantic Analysis

I am allergic to jargon! It can be used to

  • conceal the lack of substance or meaning in an idea

  • inflate the commonplace or trivial

  • exclude the unititiated

  • confer pseudo-scientific authority and claim false superiority

So what about Latent Semantic Analysis and its inevitable contraction to LSA? I came across it in an interesting conversation with Mike Joy about current issues in e-learning and assessment. LSA is a statistical method designed to measure the commonality of meaning in a collection of text passages or documents. It compares the frequency of significant words, numerically conflates their meanings, applies some mathematical jiggery-pokery to the data (viewed as sparse matrices), and comes up with some numbers that may indicate how close the texts are in meaning. It can be used as an alternative to the more familiar comparison of strings (a la Google) in detecting likely plagiarism. I believe that it is used effectively in monitoring plagiarism in program source code submitted for assessment by students in the Department of Computer Science.

Clearly jargon is both necessary and useful to experts, and LSA meets this test. It also has the virtue of meaning what it says: the analysis of hidden meaning. It might be interesting to run LSA on this and other blogs on plagiarism!

  1. I am carrying out the research into ‘Latent Semantic Analysis (LSA) and source-code plagiarism’, under the supervision of Mike Joy.

    Very briefly, Latent Semantic Analysis (LSA) applies statistical techniques to capture the major relationships between terms (e.g. words) and contexts (e.g. documents) and to categorise them into a semantic structure depending on their similarity, hence “latent semantic” in the title of the method.

    Regarding plagiarism, as an attempt to plagiarise, students may rename many words, and LSA is suitable for detecting such documents whereas software based on String matching algorithms may fail to identify them.

    LSA has been successfully applied to educational applications such as automatic essay scoring and natural-language plagiarism detection. For more information see link

    Also, LSA is commonly used in tasks such as search and retrieval, classification and filtering and would be very interesting to apply LSA to the plagiarism discussion blogs.

    On my website, I have some information about LSA and plenty of references to papers. link

    Georgina Cosma

    13 Oct 2005, 18:53

