In earlier posts, weve talked about matrix methods to tweezing out semantics. Heres another quick idea for that can be used for term sense disambiguation.
- Index all of your documents into Solr.
- Extract the term/document matrix.
- Multiply matrix with transpose of itself. This will you give you a term co-occurrence matrix. So if you have N terms, then the resulting matrix will be an N-x-N matrix. Lets examine one row, the row for “tear”. The elements of this row correspond roughly to how commonly a term co-occurs with all other terms. So for the “tear” row, youll get large scores for words like “paper” and “cry”. Although we know that you “cry tears” and “paper tears” and in both cases “tear” has different meaning, there is nothing immediately in the data of this row which indicates that “tear” has more than one sense. But…
- Find the rows corresponding to the stronger values associated with tear and extract the portion of the co-occurrence matrix associated with these rows. For instance, in the case of “tear” you will go to the rows associated with “cry” and “paper” and “apart” and “sweat” and youll pull out only the columns associated with these terms. In this case it will be a 4-x-4 matrix.
- Perform clustering on the rows from the sub-co-occurrence matrix. In the case of most words which, which practically have a single common sense, there will just be one obvious group. However with words that have multiple senses, as is the case for “tear”, youll find that this matrices fairly well decouples into sub groups, and there will be severally small elements. For instance, while “tear” co-occurs highly with “cry”, “paper”, “apart”, and “sweat”, youll immediately see that “cry” does not commonly co-occur with “paper” or “apart”, while it does co-occur moderately with “sweat” (e.g. “blood, sweat, and tears”).
Now, the big fat challenge here is going to be defining a cutoff criteria for the point at which multiple senses exist.
And then, once you know about the senses… how do you use this information?