{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### 一.基本的PageRank算法\n", "PageRank算法是用于网页权重计算的算法,它本质就是一个马尔可夫模型,每个网页$Page(i),i=1,2,..,n$就是一个状态,状态的转移概率由该页面超链接的其他网页均摊,而PageRank值即是马尔可夫模型的平稳态概率分布,平稳态概率分布的每个分量即是对应的每个网页的PageRank值,比如如下的网页A、B、C、D的链接关系: \n", "![avatar](./source/12_pagerank_demo1.png) \n", "\n", "可以很方便的写出它的状态转移矩阵(A,B,C,D分别对应0,1,2,3): \n", "\n", "$$\n", "P=\\begin{bmatrix}\n", "0 & 1/2 & 1 & 0 \\\\ \n", "1/3 & 0 & 0 & 1/2 \\\\ \n", "1/3 & 0 & 0 & 1/2\\\\ \n", "1/3 & 1/2 & 0 & 0\n", "\\end{bmatrix}\n", "$$ \n", "\n", "下面,我们可以直接利用其马尔可夫模型求解其PageRank值" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "os.chdir('../')\n", "from ml_models.pgm import SimpleMarkovModel\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#定义初始概率\n", "init_prob=np.ones(shape=(4,1))/4.0\n", "#定义概率转移矩阵\n", "trans_matrix=np.asarray([\n", " [0.,0.5,1.0,0],\n", " [1.0/3,0,0,1.0/2],\n", " [1.0/3,0,0,1.0/2],\n", " [1.0/3,1.0/2,0,0],\n", "])" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.33325195],\n", " [0.22224935],\n", " [0.22224935],\n", " [0.22224935]])" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#求解pagerank:即通过足够长的step后的平稳态概率分布\n", "smm=SimpleMarkovModel(status_num=4)\n", "smm.predict_prob_distribution(set_init_prob=init_prob,set_prob_trans_matrix=trans_matrix,time_steps=10)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.33333325],\n", " [0.22222225],\n", " [0.22222225],\n", " [0.22222225]])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#校验:设置其他的step校验一下\n", "smm.predict_prob_distribution(set_init_prob=init_prob,set_prob_trans_matrix=trans_matrix,time_steps=20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 二.带平滑项的PageRank算法\n", "但是,实际情况往往会比较复杂,比如某些网页只进不出,如下图网页C就只进不出,这样会导致马尔可夫链无法收敛到一个稳定态 \n", "\n", "![avatar](./source/12_pagerank_demo2.png) \n", "\n", "这时,它的概率转移矩阵为: \n", "\n", "$$\n", "P=\\begin{bmatrix}\n", "0 & 1/2 & 0 & 0 \\\\ \n", "1/3 & 0 & 0 & 1/2 \\\\ \n", "1/3 & 0 & 0 & 1/2\\\\ \n", "1/3 & 1/2 & 0 & 0\n", "\\end{bmatrix}\n", "$$ \n", "\n", "我们再求解一下它的PageRank值" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[2.55407417e-08],\n", " [3.72237693e-08],\n", " [3.72237693e-08],\n", " [3.72237693e-08]])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trans_matrix=np.asarray([\n", " [0.,0.5,0,0],\n", " [1.0/3,0,0,1.0/2],\n", " [1.0/3,0,0,1.0/2],\n", " [1.0/3,1.0/2,0,0],\n", "])\n", "smm.predict_prob_distribution(set_init_prob=init_prob,set_prob_trans_matrix=trans_matrix,time_steps=50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "几乎都逼近到0了,我们可以通过添加一个平滑项来解决这个问题,即: \n", "\n", "$$\n", "A=dP+\\frac{(1-d)}{n}E\n", "$$ \n", "\n", "这里,$0\\leq d \\leq 1$,$E$是元素均为1的方阵,$n$表示状态数,注意,这样对每一列求和后可能由某些列的概率概率值<1(比如我们上面问题中的第三列),所以对于每一步更新: \n", "\n", "$$\n", "\\pi_{t+1}=A\\pi_t\n", "$$ \n", "\n", "都需要对$\\pi_{t+1}$做一次归一化操作,一般取无穷范数作归一化,即$\\pi_{t+1}$中绝对值最大的分量: \n", "\n", "$$\n", "\\pi_{t+1}=\\frac{\\pi_{t+1}}{\\mid\\mid \\pi_{t+1}\\mid\\mid}\n", "$$ \n", "\n", "下面对PageRank算法做一个单独的实现: " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "PageRank算法实现,封装到ml_models.pgm\n", "\"\"\"\n", "\n", "class PageRank(object):\n", " def __init__(self, init_prob, trans_matrix):\n", " self.init_prob = init_prob\n", " self.trans_matrix = trans_matrix\n", "\n", " def get_page_rank_values(self, time_steps=None, set_init_prob=None, set_prob_trans_matrix=None):\n", " init_prob = self.init_prob if set_init_prob is None else set_init_prob\n", " trans_matrix = self.trans_matrix if set_prob_trans_matrix is None else set_prob_trans_matrix\n", " for _ in range(0, time_steps):\n", " init_prob = trans_matrix.dot(init_prob)\n", " init_prob = init_prob / np.max(np.abs(init_prob))\n", " return init_prob / np.sum(init_prob)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.1960504],\n", " [0.2679832],\n", " [0.2679832],\n", " [0.2679832]])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trans_matrix=0.85*np.asarray([\n", " [0.,0.5,0.0,0],\n", " [1.0/3,0,0,1.0/2],\n", " [1.0/3,0,0,1.0/2],\n", " [1.0/3,1.0/2,0,0],\n", "])+0.15*np.ones(shape=(4,4))/4.0\n", "pr=PageRank(init_prob=init_prob,trans_matrix=trans_matrix)\n", "pr.get_page_rank_values(10)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.19605034],\n", " [0.26798322],\n", " [0.26798322],\n", " [0.26798322]])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#检验下其他step\n", "pr.get_page_rank_values(100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }