User Story: “As a Signals PM, I want to experiment with article PageRank scores, CheiRank scores, and PageView scores so that we can eventually add them to Integrity Signals product roadmap.”
User Story: “As an SC PM, I want to cluster Wikipedia articles into sub-domains, I want to label clusters by Wikipedia Categories or other classification labels (ArticleTopic, Infobox Name, WD InstanceOf... etc). It's hoped this experiment will lead to features for Machine Readability product roadmap.”
Acceptance criteria
- Experiment with EN Wikipedia to generate PageRank, CheiRank and PageView scores for each article
- Experiment with ML cluster of articles by using the internal Wikilinks to build a Graph Database of connected articles
- Save the article Graph hierarchy to a Postgres database
- Save the PageRank, CheiRank and PageView scores to the same Postgres database
Any code examples here are just for guidance on sub-task complexity. The code is just a starting point for discussion and can completely change as the work is started.
ToDo
- 0. Database Schema Setup
- Articles: Contains all articles with columns pageURL and title.
- PageLinks: Contains all links between pages with columns pageURL (source page) and outboundLinks (target page).
- Categories: Mapping of articles to categories with columns pageURL, CategoryName.
- PageViews: Tracks views per page with columns pageURL, ViewCount.
- 1. Calculating CheiRank and PageRank (a subset of CheiRank)
- 2. Create a Link Matrix Table. Something like below, but look for ways to optimise CPU and RAM for large projects
CREATE TABLE LinkMatrix AS SELECT pl1.pageURL AS source, pl2.pageURL AS target FROM PageLinks pl1 JOIN PageLinks pl2 ON pl1.outboundLinks = pl2.pageURL;
- 3. Calculate CheiRank. This script initializes CheiRank scores to 1 for each article and iteratively updates them based on the link matrix. The process continues until changes between iterations are minimal.
CREATE TABLE CheiRankScores ( pageURL VARCHAR PRIMARY KEY, score NUMERIC ); -- Initialize CheiRank scores INSERT INTO CheiRankScores (pageURL, score) SELECT pageURL, 1.0 FROM Articles; -- Iteratively update CheiRank scores. Modify the CheiRank to incorporate PageView data, adjusting influence by page popularity: DO $$ DECLARE v_diff NUMERIC; BEGIN v_diff := 1; WHILE v_diff > 0.0001 LOOP UPDATE CheiRankScores cs SET score = 0.85 * (SELECT SUM(c.score) FROM CheiRankScores c JOIN LinkMatrix l ON c.pageURL = l.target WHERE l.source = cs.pageURL) + 0.15; GET DIAGNOSTICS v_diff = ROW_COUNT; END LOOP; END $$;
- 4. Integrate PageView Data
-- Update CheiRank score calculation to factor in PageViews DO $$ DECLARE v_diff NUMERIC; BEGIN v_diff := 1; WHILE v_diff > 0.0001 LOOP UPDATE CheiRankScores cs SET score = 0.85 * (SELECT SUM(c.score * pv.ViewCount) FROM CheiRankScores c JOIN LinkMatrix l ON c.pageURL = l.target JOIN PageViews pv ON c.pageURL = pv.pageURL WHERE l.source = cs.pageURL) + 0.15; GET DIAGNOSTICS v_diff = ROW_COUNT; END LOOP; END $$;
- 5. Dimensionality Reduction. To visualize the data in 3D, use dimensionality reduction techniques such as LDA, PCA or t-SNE. This process is typically done outside of SQL, in Python:
5.1) PCA Exp 1
import pandas as pd from sklearn.decomposition import PCA from sklearn.manifold import TSNE import matplotlib.pyplot as plt # Load data from PostgreSQL connection = "your_connection_string" query = "SELECT pageURL, score FROM CheiRankScores" data = pd.read_sql(query, connection) # Use PCA to reduce dimensions pca = PCA(n_components=3) reduced_data = pca.fit_transform(data[['score']]) # Optionally use t-SNE for better clustering visualization tsne = TSNE(n_components=3, verbose=1, perplexity=40, n_iter=300) tsne_results = tsne.fit_transform(data[['score']]) # Plot fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(tsne_results[:,0], tsne_results[:,1], tsne_results[:,2]) plt.show()
Different methods will show us different patterns and connections when mapping from higher dimensions to lower dimensions. Given enough time we should experiment with a few of these following techniques and see if there is useful semantics to be drawn from the dimension simplification.
5.2) PCA Exp 2
from sklearn.decomposition import PCA import matplotlib.pyplot as plt # Assuming 'data' is your DataFrame loaded from the database pca = PCA(n_components=3) pca_results = pca.fit_transform(data) # Plotting plt.figure(figsize=(8, 6)) plt.scatter(pca_results[:, 0], pca_results[:, 1]) plt.title('PCA Results') plt.xlabel('PC1') plt.ylabel('PC2') plt.show()
5.3) UMAP
import umap umap_results = umap.UMAP(n_neighbors=15, n_components=3, min_dist=0.1, metric='euclidean').fit_transform(data) plt.figure(figsize=(8, 6)) plt.scatter(umap_results[:, 0], umap_results[:, 1]) plt.title('UMAP Results') plt.xlabel('UMAP1') plt.ylabel('UMAP2') plt.show()
5.4) MDS
from sklearn.manifold import MDS mds = MDS(n_components=3, metric=True) mds_results = mds.fit_transform(data) plt.figure(figsize=(8, 6)) plt.scatter(mds_results[:, 0], mds_results[:, 1]) plt.title('MDS Results') plt.xlabel('Dimension 1') plt.ylabel('Dimension 2') plt.show()
5.5) ISOMAP
from sklearn.manifold import Isomap isomap = Isomap(n_components=3, n_neighbors=5) isomap_results = isomap.fit_transform(data) plt.figure(figsize=(8, 6)) plt.scatter(isomap_results[:, 0], isomap_results[:, 1]) plt.title('Isomap Results') plt.xlabel('Dimension 1') plt.ylabel('Dimension 2') plt.show()
- 6. Cluster Labeling. Cluster the articles based on their t-SNE coordinates. You can use a clustering algorithm like K-means or DBSCAN. Here's an example using K-means:
from sklearn.cluster import KMeans # Perform clustering kmeans = KMeans(n_clusters=10, random_state=42) data['cluster'] = kmeans.fit_predict(embedding) #Identify the top 5 categories for each cluster using the PageView data as weights: top_categories = data.groupby(['cluster', 'CategoryName'])['ViewCount'].sum().reset_index() top_categories = top_categories.sort_values(['cluster', 'ViewCount'], ascending=[True, False]) top_categories = top_categories.groupby('cluster').head(5)
Another approach:
# Calculate top categories for each cluster def get_top_categories(cluster_id): cluster_pages = data[data['cluster'] == cluster_id]['pageURL'] categories = pd.read_sql("SELECT CategoryName FROM Categories WHERE pageURL IN {}".format(tuple(cluster_pages)), connection) weighted_counts = categories['CategoryName'].value_counts().head(5) return weighted_counts.index.tolist() # Map top categories to each cluster cluster_labels = {cluster: get_top_categories(cluster) for cluster in data['cluster'].unique()} # Now, you can add labels to your plot or use them for analysis print(cluster_labels)
- 7. Can we compress the Link Matrix with a Neural Network encoder/decoder, without losing too much information? Maybe this could be our API payload format for clients. Here we're using:
Variational Autoencoder (VAE)
import numpy as np import tensorflow as tf from tensorflow.keras.layers import Input, Dense, Lambda from tensorflow.keras.models import Model from tensorflow.keras.losses import binary_crossentropy from tensorflow.keras import backend as K def sampling(args): z_mean, z_log_var = args batch = K.shape(z_mean)[0] dim = K.int_shape(z_mean)[1] epsilon = K.random_normal(shape=(batch, dim)) return z_mean + K.exp(0.5 * z_log_var) * epsilon input_dim = data.shape[1] # number of features encoding_dim = 3 # number of dimensions in the bottleneck inputs = Input(shape=(input_dim,)) encoded = Dense(128, activation='relu')(inputs) z_mean = Dense(encoding_dim)(encoded) z_log_var = Dense(encoding_dim)(encoded) z = Lambda(sampling, output_shape=(encoding_dim,))([z_mean, z_log_var]) decoded = Dense(128, activation='relu')(z) decoded = Dense(input_dim, activation='sigmoid')(decoded) # Encoder and Decoder models encoder = Model(inputs, [z_mean, z_log_var, z]) decoder_input = Input(shape=(encoding_dim,)) decoder_layer = encoder.layers[-2](decoder_input) decoder_layer = encoder.layers[-1](decoder_layer) decoder = Model(decoder_input, decoder_layer) # VAE model vae = Model(inputs, decoder(encoder(inputs)[2])) # Loss function reconstruction_loss = binary_crossentropy(inputs, decoder(encoder(inputs)[2])) reconstruction_loss *= input_dim kl_loss = 1 + z_log_var - K.square(z_mean) - K.exp(z_log_var) kl_loss = K.sum(kl_loss, axis=-1) kl_loss *= -0.5 vae_loss = K.mean(reconstruction_loss + kl_loss) vae.add_loss(vae_loss) vae.compile(optimizer='adam') # Train VAE vae.fit(data, epochs=50, batch_size=32
- 8. Visualize the article clusters in 3D using a plotting library like Matplotlib or Plotly, and label each cluster with its top 5 categories.
Test Strategy
This is a POC experiment, no testing at this stage
Things to consider:
- As a POC this will run locally on a desktop. When deciding on a tech solution, consider scaling this for all WME projects and running daily. We'll need to keep costs down.
Description (optional)
- Build a Wikilinks database using the internal project links in HTML pages. Then build a weighting system that mimics Google's PageRank. Since CheiRank is a superset of PageRank, we'll build the CheiRank scores. Combine the weights with PageView scores to improve the graph network representation of how users read Wikipedia.
- Since we have the Link matrix, we can investigate mapping this high-dimensional data into something that is easier to visualise. In doing so we can create a classification system by using our WMF categories (and/or Infobox Names or WD instanceOf labels)