Page MenuHomePhabricator

{Experiment} - Build a Graph database of article links and calculate PageRank, CheiRank and PageView weights
Open, Needs TriagePublic

Description

User Story: “As a Signals PM, I want to experiment with article PageRank scores, CheiRank scores, and PageView scores so that we can eventually add them to Integrity Signals product roadmap.”

User Story: “As an SC PM, I want to cluster Wikipedia articles into sub-domains, I want to label clusters by Wikipedia Categories or other classification labels (ArticleTopic, Infobox Name, WD InstanceOf... etc). It's hoped this experiment will lead to features for Machine Readability product roadmap.”

Acceptance criteria

  1. Experiment with EN Wikipedia to generate PageRank, CheiRank and PageView scores for each article
  2. Experiment with ML cluster of articles by using the internal Wikilinks to build a Graph Database of connected articles
  3. Save the article Graph hierarchy to a Postgres database
  4. Save the PageRank, CheiRank and PageView scores to the same Postgres database

Any code examples here are just for guidance on sub-task complexity. The code is just a starting point for discussion and can completely change as the work is started.

ToDo

  • 0. Database Schema Setup
    1. Articles: Contains all articles with columns pageURL and title.
    2. PageLinks: Contains all links between pages with columns pageURL (source page) and outboundLinks (target page).
    3. Categories: Mapping of articles to categories with columns pageURL, CategoryName.
    4. PageViews: Tracks views per page with columns pageURL, ViewCount.
  • 1. Calculating CheiRank and PageRank (a subset of CheiRank)
  • 2. Create a Link Matrix Table. Something like below, but look for ways to optimise CPU and RAM for large projects
CREATE TABLE LinkMatrix AS
SELECT 
    pl1.pageURL AS source,
    pl2.pageURL AS target
FROM 
    PageLinks pl1
JOIN 
    PageLinks pl2 ON pl1.outboundLinks = pl2.pageURL;
  • 3. Calculate CheiRank. This script initializes CheiRank scores to 1 for each article and iteratively updates them based on the link matrix. The process continues until changes between iterations are minimal.
CREATE TABLE CheiRankScores (
    pageURL VARCHAR PRIMARY KEY,
    score NUMERIC
);

-- Initialize CheiRank scores
INSERT INTO CheiRankScores (pageURL, score)
SELECT pageURL, 1.0 FROM Articles;

-- Iteratively update CheiRank scores. Modify the CheiRank to incorporate PageView data, adjusting influence by page popularity:
DO $$
DECLARE
    v_diff NUMERIC;
BEGIN
    v_diff := 1;
    WHILE v_diff > 0.0001 LOOP
        UPDATE CheiRankScores cs SET
        score = 0.85 * (SELECT SUM(c.score) FROM CheiRankScores c JOIN LinkMatrix l ON c.pageURL = l.target WHERE l.source = cs.pageURL) + 0.15;
        
        GET DIAGNOSTICS v_diff = ROW_COUNT;
    END LOOP;
END $$;
  • 4. Integrate PageView Data
-- Update CheiRank score calculation to factor in PageViews
DO $$
DECLARE
    v_diff NUMERIC;
BEGIN
    v_diff := 1;
    WHILE v_diff > 0.0001 LOOP
        UPDATE CheiRankScores cs SET
        score = 0.85 * (SELECT SUM(c.score * pv.ViewCount) FROM CheiRankScores c 
                        JOIN LinkMatrix l ON c.pageURL = l.target 
                        JOIN PageViews pv ON c.pageURL = pv.pageURL
                        WHERE l.source = cs.pageURL) + 0.15;
        
        GET DIAGNOSTICS v_diff = ROW_COUNT;
    END LOOP;
END $$;
  • 5. Dimensionality Reduction. To visualize the data in 3D, use dimensionality reduction techniques such as LDA, PCA or t-SNE. This process is typically done outside of SQL, in Python:

5.1) PCA Exp 1

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Load data from PostgreSQL
connection = "your_connection_string"
query = "SELECT pageURL, score FROM CheiRankScores"
data = pd.read_sql(query, connection)

# Use PCA to reduce dimensions
pca = PCA(n_components=3)
reduced_data = pca.fit_transform(data[['score']])

# Optionally use t-SNE for better clustering visualization
tsne = TSNE(n_components=3, verbose=1, perplexity=40, n_iter=300)
tsne_results = tsne.fit_transform(data[['score']])

# Plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(tsne_results[:,0], tsne_results[:,1], tsne_results[:,2])
plt.show()

Different methods will show us different patterns and connections when mapping from higher dimensions to lower dimensions. Given enough time we should experiment with a few of these following techniques and see if there is useful semantics to be drawn from the dimension simplification.

5.2) PCA Exp 2

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Assuming 'data' is your DataFrame loaded from the database
pca = PCA(n_components=3)
pca_results = pca.fit_transform(data)

# Plotting
plt.figure(figsize=(8, 6))
plt.scatter(pca_results[:, 0], pca_results[:, 1])
plt.title('PCA Results')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

5.3) UMAP

import umap

umap_results = umap.UMAP(n_neighbors=15, n_components=3, min_dist=0.1, metric='euclidean').fit_transform(data)

plt.figure(figsize=(8, 6))
plt.scatter(umap_results[:, 0], umap_results[:, 1])
plt.title('UMAP Results')
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')
plt.show()

5.4) MDS

from sklearn.manifold import MDS

mds = MDS(n_components=3, metric=True)
mds_results = mds.fit_transform(data)

plt.figure(figsize=(8, 6))
plt.scatter(mds_results[:, 0], mds_results[:, 1])
plt.title('MDS Results')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.show()

5.5) ISOMAP

from sklearn.manifold import Isomap

isomap = Isomap(n_components=3, n_neighbors=5)
isomap_results = isomap.fit_transform(data)

plt.figure(figsize=(8, 6))
plt.scatter(isomap_results[:, 0], isomap_results[:, 1])
plt.title('Isomap Results')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.show()
  • 6. Cluster Labeling. Cluster the articles based on their t-SNE coordinates. You can use a clustering algorithm like K-means or DBSCAN. Here's an example using K-means:
from sklearn.cluster import KMeans

# Perform clustering
kmeans = KMeans(n_clusters=10, random_state=42)
data['cluster'] = kmeans.fit_predict(embedding)

#Identify the top 5 categories for each cluster using the PageView data as weights:
top_categories = data.groupby(['cluster', 'CategoryName'])['ViewCount'].sum().reset_index()
top_categories = top_categories.sort_values(['cluster', 'ViewCount'], ascending=[True, False])
top_categories = top_categories.groupby('cluster').head(5)

Another approach:

# Calculate top categories for each cluster
def get_top_categories(cluster_id):
    cluster_pages = data[data['cluster'] == cluster_id]['pageURL']
    categories = pd.read_sql("SELECT CategoryName FROM Categories WHERE pageURL IN {}".format(tuple(cluster_pages)), connection)
    weighted_counts = categories['CategoryName'].value_counts().head(5)
    return weighted_counts.index.tolist()

# Map top categories to each cluster
cluster_labels = {cluster: get_top_categories(cluster) for cluster in data['cluster'].unique()}

# Now, you can add labels to your plot or use them for analysis
print(cluster_labels)
  • 7. Can we compress the Link Matrix with a Neural Network encoder/decoder, without losing too much information? Maybe this could be our API payload format for clients. Here we're using:

Variational Autoencoder (VAE)

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Lambda
from tensorflow.keras.models import Model
from tensorflow.keras.losses import binary_crossentropy
from tensorflow.keras import backend as K

def sampling(args):
    z_mean, z_log_var = args
    batch = K.shape(z_mean)[0]
    dim = K.int_shape(z_mean)[1]
    epsilon = K.random_normal(shape=(batch, dim))
    return z_mean + K.exp(0.5 * z_log_var) * epsilon

input_dim = data.shape[1]  # number of features
encoding_dim = 3  # number of dimensions in the bottleneck

inputs = Input(shape=(input_dim,))
encoded = Dense(128, activation='relu')(inputs)
z_mean = Dense(encoding_dim)(encoded)
z_log_var = Dense(encoding_dim)(encoded)
z = Lambda(sampling, output_shape=(encoding_dim,))([z_mean, z_log_var])
decoded = Dense(128, activation='relu')(z)
decoded = Dense(input_dim, activation='sigmoid')(decoded)

# Encoder and Decoder models
encoder = Model(inputs, [z_mean, z_log_var, z])
decoder_input = Input(shape=(encoding_dim,))
decoder_layer = encoder.layers[-2](decoder_input)
decoder_layer = encoder.layers[-1](decoder_layer)
decoder = Model(decoder_input, decoder_layer)

# VAE model
vae = Model(inputs, decoder(encoder(inputs)[2]))

# Loss function
reconstruction_loss = binary_crossentropy(inputs, decoder(encoder(inputs)[2]))
reconstruction_loss *= input_dim
kl_loss = 1 + z_log_var - K.square(z_mean) - K.exp(z_log_var)
kl_loss = K.sum(kl_loss, axis=-1)
kl_loss *= -0.5
vae_loss = K.mean(reconstruction_loss + kl_loss)

vae.add_loss(vae_loss)
vae.compile(optimizer='adam')

# Train VAE
vae.fit(data, epochs=50, batch_size=32
  • 8. Visualize the article clusters in 3D using a plotting library like Matplotlib or Plotly, and label each cluster with its top 5 categories.
Test Strategy

This is a POC experiment, no testing at this stage

Things to consider:
  • As a POC this will run locally on a desktop. When deciding on a tech solution, consider scaling this for all WME projects and running daily. We'll need to keep costs down.
Description (optional)
  • Build a Wikilinks database using the internal project links in HTML pages. Then build a weighting system that mimics Google's PageRank. Since CheiRank is a superset of PageRank, we'll build the CheiRank scores. Combine the weights with PageView scores to improve the graph network representation of how users read Wikipedia.
  • Since we have the Link matrix, we can investigate mapping this high-dimensional data into something that is easier to visualise. In doing so we can create a classification system by using our WMF categories (and/or Infobox Names or WD instanceOf labels)

Event Timeline

Possible other research to include in the evaluation of Link Scoring:

Some of these systems are focused on Internet pages rather than "quality" Wikipedia pages. But we may learn from these to identify pages that have more or less "gaming" to improve their visibility. Like editors packing their category list with too many categories. Or add more text/keywords then is necessary for the topic it is discussing, we should compare page length vs page views and see if there is any correlation, and vice-versa see if any short pages have high PVs. Can we correlate edit count to PV activity? Can we identify/downgrade/upgrade "edit wars"? Are some editors producing higher PVs than others? Does page reverts impact PVs or correlate to other page metrics, can we reverse that correlation and find other revertRisk measures?

https://en.wikipedia.org/wiki/TrustRank
https://en.wikipedia.org/wiki/Spam_mass
https://en.wikipedia.org/wiki/Hilltop_algorithm
https://en.wikipedia.org/wiki/Spamdexing
https://en.wikipedia.org/wiki/Adversarial_information_retrieval