The future of the semantic web is LSI
May 13th, 2003
LSI stands for Latent Semantic Indexing. I found it on this interesting story (via Jon Udell) and dived into it last night. I downloaded all the papers I could find and went thru the math (seems complex at first but it’s not, it’s just a bunch of highly multidimensional linear algebra).
LSI emerged as a way to improve searching of large quantities of unstructured text, then Google hyperlink-topology-based approach proved to be vastly superior (and much less computationally requiring) to text-only approaches and silenced all tries to improve textual searches. Now it has been resorted in the battle against spam. You can bet that pretty sure everybody will be talking about LSI vs. Baysian. And, in fact, LSI has the potential to kill bayesian approaches because it can match documents where the word wasn’t even present!!!
LSI works simply by describing documents into a syntactic space then rotating it into a semantic space. It’s not so different from good old Fourier, Laplace or even Haar and Daubechies transformations (or JPG/MPEG compression for that matter): you are rotating a hyperdimensional space until you find a group of orthogonal eigenvectors that are capable of describing the same solution space, hopefully with a lower dimensionality.
Querying is just a matter of measuring the distance of the query as described into the rotated space from all the documents in the space.
Ranking is performed by ordering that distance.
The results are impressive: you are no longer searching for a token (a syntactic axis) but on the orthogonalization of those axis in a text corpus. That is: you are searching against its semantics as extracted by the way the tokens relate in the various documents.
This is the reason why LSI is capable of coming up with a significant result even if the word you are searching for is not even into the document!!!! you are matching against the transparent and all-text-permeating relations of the words, not on the words themselves!
LSI has been proven effective to judge between synonyms by extracting contextual information automatically! It is also capable of passing the TOEFL text with 65% score (the average student score) and has been compared to the semantic discerning abilities of a 8-years-old kid.
Forget RDF, topic maps and all that semantic catalogs impositions: LSI is future of the semantic web. And, potentially, if aggregated with markup-extracted information, a google killer.
This is a happy day for the web.