Welcome! I hope that you enjoyed my presentation of the papers by Richardson and Domingos and Menczer. In this page, I've included my presentation and a brief outline of it, some thoughts on the class discussion afterward, and some resources that you may find useful if you're interested in further exploring Web mining.
The time for discussion and comments doesn't have to be over! Please, if you have any remaining questions or thoughts on these papers, share them with the rest of us on the class mailing list!
You can find my slides in PowerPoint format right here.
A lot of our class discussion seemed to center on speculation as to whether subsets of the Web devoted to particular topics generally share a similiar structure. I think that if we examine the subset of the Web that contains only pages with a particular term, then random graph theory implies that we should still have a single large component. However, it is not at all clear to me that the structure of this component will be such that running PageRank on it is quite as meaningful.
Heather and I discussed the query-dependent PageRank idea quite a bit with each other after class, and we reached the conclusion that the authors greatly underestimate the importance of combining words in multiple-word queries. Almost every useful query does contain multiple words, and those words usually seem to have a close association with each other.
What happens if you search for "Kelley Blue Book" in the QD-PageRank system? Can we expect the "blue" Web and the "book" Web to have any meaningful relationship with one another? Would the "blue" Web have any coherent structure? Are there pages about the world of blue?
Here's some food for thought: Handling multiple query terms is hard enough in English. What about languages that build up complex thoughts through the addition of prefixes and suffixes? In such a language, the possible space of "words" is far beyond the capacity of any computer system to manage. The ability to search for parts of a word becomes vital. How can a crawler or search engine cope with this situation?
Here's a collection of resources and other things to look at that I feel might be useful if this topic stirs your interest:
Here is the Perl source code that I wrote as a simulation of Prof. Menczer's generative model. It produces Pajek files as output. Please feel free to mess with the code--and double-check my understanding of the model!
The presentations by Monika Henzinger and Steve Lawrence that Prof. Börner linked to on the course home page really are quite interesting. In particular, Henzinger presents some very interesting results on random walks and the connections between top-level domains.
This is the textbook we used last year in Prof. Menczer's course on Web mining. There's a lot of useful content in the book itself, and the bibliography is particularly helpful.
This is the original PageRank paper, and it's quite accessible.
Prof. Menczer has quite a few papers related to topical Web crawlers, which you can find in the papers section of his home page.