Page MenuHomePhabricator

Analyze Dumps Usage Through Apache Logs
Closed, ResolvedPublic

Description

Background

We need better visibility into how different users consume Wikimedia dumps to inform our API strategy and improve the developer experience. This task proposes analyzing apache logs to understand dumps usage patterns.

Scope

Process apache logs data (referenced in T119070: Track the number of wikidata dumps that are downloaded by type) to gather usage metrics. Estimated timeline: 1 week for 2 engineers.

Expected Outcomes

We aim to answer these key questions:

  1. What is user preference for full dumps vs incremental dumps?
  2. What is the utilization across dump types? (eg: XML, old HTML, Enterprise HTML, backups, analytics)
  3. Are users typically downloading the latest version, or is there a preference for older versions?
  4. What is the utilization across Wikimedia projects? (eg: language specific utilization; Wikipedia vs Wiktionary)
  5. How are dumps being downloaded? (eg: bot/automated process vs human)
  6. What are the most popular mirrors to click through to?

Additionally we would like to learn about:

  • User Demographics
    • Identify proportion of human vs bot downloaders
    • Distinguish between corporate and volunteer community users
  • Usage Patterns
    • Analyze preferences for monthly full dumps vs frequent updates
    • Determine which dump types are most valuable to different user groups

Known Limitations

Risk: Log data may not provide sufficient resolution to fully answer all questions about usage patterns and user segmentation.

Related Work

Dependencies

This work supports Q3 request for Research & Decision Science teams to analyze content access patterns as part of WE 5.5: API Strategy.

By the end of January, we will be able to measure and monitor Wikimedia hosted dumps traffic using log data, which will provide clarity on how users are consuming the different dumps options and access points. This will, in turn, will improve our understanding of what users care about in terms of recency, data completion, and structure, so that we can tailor the overall API strategy accordingly.

Next Steps

  1. Implement log processing pipeline
  2. Create visualization dashboard (in Superset?)
  3. Document findings and limitations

Deadline

January 31, 2025

Details

Other Assignee
Ahoelzl

Event Timeline

@Ahoelzl this is some support work that we need to do for WE 5.5. Deadline is end of January. Who from data engineering could pick this up?

Just for reference, these access logs for https://dumps.wikimedia.org are available for analysis on stat1011.

btullis@stat1011:/srv/log/webrequest/archive/dumps.wikimedia.org$ ls -lrt|tail
-rw-r--r-- 1 root root  3324557 Jan  3 00:00 error.log-20250103.gz
-rw-r--r-- 1 root root  8162742 Jan  3 00:00 access.log-20250103.gz
-rw-r--r-- 1 root root  2824811 Jan  4 00:00 error.log-20250104.gz
-rw-r--r-- 1 root root  8985045 Jan  4 00:00 access.log-20250104.gz
-rw-r--r-- 1 root root  2816792 Jan  5 00:00 error.log-20250105.gz
-rw-r--r-- 1 root root  7637700 Jan  5 00:00 access.log-20250105.gz
-rw-r--r-- 1 root root  2848086 Jan  6 00:00 error.log-20250106.gz
-rw-r--r-- 1 root root  7467706 Jan  6 00:00 access.log-20250106.gz
-rw-r--r-- 1 root root  2383508 Jan  7 00:00 error.log-20250107.gz
-rw-r--r-- 1 root root  6774988 Jan  7 00:00 access.log-20250107.gz

I had to move them from stat1007 to a newer stat host as part of T353785

Hi @VirginiaPoundstone yes, thank you for asking. I just finished a preliminary analysis of the logs by months - months because I kept running into network issues. I will work on collating the months and analysing further. cc @Ahoelzl

Preparation work

  • Download all files comprising 6 months of data
  • Making files available for data analysis

Analysis goals

  • Counts of detailed access paths
  • Aggregation of meaningful parent path elements (whatever represents a wiki or a "other" dump, …)
  • Count request IPs (understand main requester)
  • Identify infrequently accesses entities

Aggregation over a 6 months time window should provide a good average analysis of access. Not further time based analysis needed.

@JEbe-WMF are you also preparing a visualization dashboard (in Superset) and a findings and limitations report?

Do you need any support or a second pair of eyes on the data?

I was not aware you could have a visualisation dashboard on superset, i am exploring this. Yes i could also use an extra set of eyes

I browsed through the data for the dumps logs. It's interesting, especially seeing ZH content as the top hit. Were you able to determine if there is any user information that we can capture from the logs, to determine unique users and downloads there are?

Also -- I see that the top paths are broken out by date. Thoughts on our ability to aggregate or otherwise normalize utilization, regardless of date? Generally curious if there is anything special that happened in August too, given 3 of the top routes are from August dumps. Is that the first time we saw ZH wiki data pulled? Similarly, where it says 'multistream', does that count as multiple connections simultaneously, meaning the count could be misleadingly high?

Hello team! Just providing an update :D
A dashboard has been created! Find the dashboard here
Still working script for the cleaning the data so the dashboard only contains data for 2024-08 as i am using it as the yardstick for cleaning

Excellent!! Thank you for the update :D Would you be able to grant me access? I'm getting "Permission denied: user=hcoplin, access=EXECUTE"

I'm getting the same error message too.

We are working through assessing all dumps and their variations to provide dump / wiki specific aggregation rules.
Work document here:
https://docs.google.com/document/d/1XahpFdTIV3kf28ft6NaEOE8IHZQ_3jrNed52AiT7MUY/edit?tab=t.0

@Ahoelzl do you have a sense of how much time is required to complete this analysis?

Do we think that the access logs we have are sufficiently relevant? Given our bandwidth limitation, I suspect that most users are downloading dumps from mirrors.

We have finished the log path parsing code that will classify and augment the dump webrequest logs. See: https://docs.google.com/spreadsheets/d/1HjzXM3dn5m82KVmCYKIuaNPG1eJfsU18-p7mmucAnj0/edit?pli=1&gid=0#gid=0
Next step is to integrate this code into the existing dump webrequest parser to generate a final augmented dumps download table. I think we can have this finished by tomorrow.
Then, we'll need to aggregate the augmented dump download table into a cube-like aggregated table, ready for analysis. This should take a couple hours more.
If all goes well, we'll have the queryable table by the end of this week.
The main risk is that we could discover that the current parsing was not detailed enough for analysis (for some dumps), and so we would need to refine it.

Do we think that the access logs we have are sufficiently relevant? Given our bandwidth limitation, I suspect that most users are downloading dumps from mirrors.

That's a good question.
As you suggest, probably our access log download counts are significantly smaller than the total count.
And, we probably can not assume that the overall proportions of the download count per dump type will be held by the dumps.wikimedia.org sample.
@Ahoelzl was this a consideration at the beginning of the project? Are we fine to only use our own access logs?

@mforns @Gehel we only have access to our logs, that's been a constraint from the beginning.

Initial dashboard with aggregated counts available here:
https://superset.wikimedia.org/superset/dashboard/2f06e4cb-86a1-4a5a-ba04-b5bb1e77f71e/?native_filters_key=zod_hb22SIT5DICT6rkvwtS7ITYKPBpmCRkjn4vnOoh6205jE8cTtWWDceB44mkR

Caveats: currently all file requests are counted towards a specific dump download. An updated version will be provided shortly to adjust for that. The dashboard and results are pending a second review to ensure correctness in data processing and reporting.

Update: parser and aggregation code has been extensively test.

UPDATED 2/26

Although the initial analysis goal has been redefined here https://phabricator.wikimedia.org/T383175#10481404 (due to lack of available information in the log data),
here is some clarity on the delivered analysis results and the initial scope of the request:

We aim to answer these key questions:
What is user preference for full dumps vs incremental dumps?

SELECT
    dump_name,
    SUM(CASE WHEN time LIKE '202%01' THEN 1 ELSE 0 END) AS full_dump,
    SUM(CASE WHEN time NOT LIKE '202%01' THEN 1 ELSE 0 END) AS partial_dump
FROM dump_logs_updated
WHERE time LIKE '202%' AND LENGTH(time) = 8 
GROUP BY dump_name;
dump_namefull_dumppartial_dumppercentage full_dump
pagetitles15,829463,7873.3%
cirrussearch8,096239,9423.3%
categoriesrdf_daily179,4795,444,4543.2%
wikibase_commonswiki_historical1811,39911.5%
shorturls678867.0%
content_translation3,50187,5583.8%
categoriesrdf_historical1,500105,7631.4%
wikibase_wikidatawiki_historical118,249601,56716.4%
mediatitles17,003504,8323.3%
pageview_complete20,233605,3603.2%
pageviews168,4194,781,7633.4%
database_backup21,534,5297,146,16275.1%

What is the utilization across dump types? (eg: XML, old HTML, Enterprise HTML, backups, analytics)

Dumps types are distinguished, e.g. XML, HTML, backups … but we don't have information on Enterprise downloads.

Are users typically downloading the latest version, or is there a preference for older versions?

Only database_backup (SQL/XML dumps) provide "latest" information

SELECT
    dump_name,
    SUM(CASE WHEN time LIKE '%latest%' THEN 1 ELSE 0 END) AS latest_version,
    SUM(CASE WHEN time NOT LIKE '%latest%' THEN 1 ELSE 0 END) AS specific_version
FROM dump_logs_updated
GROUP BY dump_name;
dump_namelatest_versionspecific_versionpercentage
database_backup101840373033460233.57%

What is the utilization across Wikimedia projects? (eg: language specific utilization; Wikipedia vs Wiktionary)

Available in chart breakdown on dashboard.

How are dumps being downloaded? (eg: bot/automated process vs human)

We only have IP and user agent request information.
Top user agents breakdown is available on dashboard.

From a first glance interpretation most user agents seem to be indicate bot traffic (as expected).

What are the most popular mirrors to click through to?

Can we tell that from the IP information?

Additionally we would like to learn about:
User Demographics
Identify proportion of human vs bot downloaders
Distinguish between corporate and volunteer community users

Can we tell that from the IP information?

Usage Patterns
Analyze preferences for monthly full dumps vs frequent updates
Determine which dump types are most valuable to different user groups

We don't have information on user groups.

Thanks for the detailed update, @Ahoelzl ! As a follow up, I'm still unable to access the dashboard due to permissions issues. Could you help me resolve that so I can review in more detail?

@Ottomata -- are you able to grant me the LDAP access? I created the phabricator ticket per the instructions, but I'm not sure how long approvals usually take https://phabricator.wikimedia.org/T387459

I probably can but it will take me ages to remember how to do it. I just bumped DPE-SRE, they are usually pretty snappy.

BTW, you won't need explicit approval since you are WMF employee, someone just has to do it :)

Howdy! Following up on the asks we had during the sync this morning. I captured the main requests below; let me know if you have additional questions about these asks.

[Top priority -- If possible, would be great to have these by next week to support data exploration at leadership meeting]

  • Dedupe traffic for more accurate view of most commonly used dump
  • Allow filtering/searching by user-agent --> Applies to all other graphs/views so they are displayed per user agent

[Next priority]

  • See if we can use heuristics to categorize user agents into "probably human" vs "probably automated"
  • Add additional filtering/slices:
    • Date range selection
    • Dump type
    • Wiki category (eg: wictionary vs wikipedia)
    • Language

[Low priority/nice to have]

  • IP insights:
    • Location of data requests
    • IP mapping against known actors/bots
  • Data volume associated with requests

@HCoplin-WMF did you get the LDAP access?

@Ahoelzl when do you expect these additional view of the data to be added to the dashboard?

Ahoelzl updated Other Assignee, added: Ahoelzl.
Ahoelzl added a subscriber: JEbe-WMF.

Hey all!
Here's a second version of the dashboard with:

  • Deduplicated multipart range requests (206)
  • UserAgent filter
  • Self identified bots classification
  • Additional filters (date range, granularity, dump type, wiki project and language)
  • Country of the requests
  • Downloaded bytes
  • ISP info

https://superset.wikimedia.org/superset/dashboard/579/

Fabulous!! Thank you for the quick turnaround. I also got access approved yesterday, so I will take a deeper look this afternoon.

First impressions are wonderful!

I do have a couple of immediate follow up questions, if you wouldn't mind clarifying:

  1. Could you explain what the "Correlation between download time and data interval" chart means? I'm not exactly sure how to interpret it.
  2. How is the self-identified bots classification determined?

Great work @mforns !
Happy you have access now @HCoplin-WMF .

Regarding your follow up questions:

  1. the chart helps understand what dump version (dump time) has been requested at download time. Most requests are for the current / latest dump version, but there is a trail of requests for older versions.
  2. based on user-agent string analysis, leveraging org.wikimedia.analytics.refinery.hive.IsSpiderUDF see refinery source

General note: deduplicating multi-file dumps and filtering on 20x paints a very different picture, database_dumps now on top as expected.

Great! Thank you for clarifying. The dashboard is great!

The only other minor ask I might have is if we could add a multi-select "starts with" style filter for the IP ranges? It'll make it easier to look at internal & toolforge traffic (or strip them out instead, if it's easy to do a "contains"/"does not contain" option). Technically you can filter by individual IPs in the list now, but it would be nice to be able to look into ranges a little more easily. Don't worry about it if that's a huge lift though! Although the queries to get that seem pretty straightforward, I legitimately have no idea how much effort adding that is in superset, so I can make it work as-is if that's a lot of effort.

@HCoplin-WMF I added a search box ("contains") of IP filtering, prefix search does not seem to be available out of the box.