M.Sc Thesis

M.Sc StudentMashiach Li-Tal
SubjectLocal Approximation of PageRank and Reverse PageRank
DepartmentDepartment of Computer Science
Supervisor MR Ziv Bar-Yossef
Full Thesis textFull thesis text - English Version


Local PageRank approximation: The vast majority of algorithms for computing PageRank have focused on global computation of the PageRank vector. While in many applications of PageRank a global computation is needed, there are situations in which one is interested in computing PageRank scores for just a small subset of the nodes.

Lower bounds: We prove that local approximation of PageRank, even to within modest approximation factors, is infeasible in the worst-case, as it requires probing the link server for Omega(n) nodes, where n is the size of the graph. The difficulty emanates from nodes of high in-degree and/or from slow convergence of the PageRank random walk.

Sufficiency: We observe that the algorithm of Chen, Gan, and Suel (CIKM 2004) works well for graphs that have bounded in-degree and admit fast PageRank convergence: if the PageRank random walk converges on the graph in r steps and if the maximum in-degree of the graph is d, then the algorithm of Chen et al. crawls a subgraph of size at most d^r and thus requires at most this number of queries to the link server. When d and r are small, the algorithm is efficient. This demonstrates that the two conditions we showed to be necessary for fast local PageRank approximation are also sufficient.

PageRank vs. Reverse PageRank: By analyzing the stanford.edu crawl, we show that the reverse web graph, like the web graph, admits quick PageRank convergence (on 80% nodes of the reverse graph, PageRank converged within 20 iterations). We also show that the reverse graph has a bounded in-degree (only 255 as opposed to 38,606 in the regular graph). These findings hint that local PageRank approximation should be feasible on the reverse graph.

To put this hypothesis to test, we measured the performance of the algorithm of

Chen et al. on a sample of nodes from the stanford.edu graph. We show that for highly ranked nodes the performance of the algorithm on the reverse graph is up to three times better than on the regular graph.

Applications of Reverse PageRank: We observe that RPR has a multitude of applications. It has been used before to select good seeds for the TrustRank measure, to detect highly influential nodes in social networks, and to find hubs in the web graph. We present two additional novel applications of RPR: (1) finding good seeds for crawling; and (2) measuring the “semantic relatedness” of concepts in a taxonomy.