Page MenuHomePhabricator

JAllemandou (joal)
Data Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Feb 11 2015, 6:02 PM (331 w, 3 d)
Availability
Available
IRC Nick
joal
LDAP User
Unknown
MediaWiki User
JAllemandou (WMF) [ Global Accounts ]

Recent Activity

Fri, Jun 11

JAllemandou added a comment to T284623: Top edited pages list on enwiktionary contains nonexistent pages with titles made up of question marks.

I ran a query and found that the problem is due to some page-titles not being correctly decoded and seen as '?', and since the top metrics doesn't differentiate by page-id multiple pages are bundled together since they share the same wrong title:

spark.sql("""
 SELECT
   page_title,
   count(1) as c
 from wmf.mediawiki_history
 where snapshot = '2021-05'
   and wiki_db = 'enwiktionary'
   and date(event_timestamp) >= '2021-05-01'
   and event_entity = 'revision'
   and page_namespace_is_content
 group by page_title
 order by c desc
 limit 10
""").show(10, false)
Fri, Jun 11, 7:55 AM · Analytics-Kanban, Analytics, Analytics-Wikistats

Tue, Jun 8

JAllemandou moved T284537: Move WikimediaEventUtilities logging to Slf4j from Next Up to In Code Review on the Analytics-Kanban board.
Tue, Jun 8, 4:05 PM · Patch-For-Review, Analytics, Analytics-Kanban
JAllemandou claimed T284537: Move WikimediaEventUtilities logging to Slf4j.
Tue, Jun 8, 7:42 AM · Patch-For-Review, Analytics, Analytics-Kanban
JAllemandou created T284537: Move WikimediaEventUtilities logging to Slf4j.
Tue, Jun 8, 7:41 AM · Patch-For-Review, Analytics, Analytics-Kanban

Mon, Jun 7

JAllemandou updated subscribers of T283084: Missing hourly partition for event.mediawiki_revision_recommandation_create.

Heya @Ottomata - Could you please provide a status summary on this (asked by @Gehel on IRC) - thanks :)

Mon, Jun 7, 12:02 PM · Analytics-Kanban, Patch-For-Review, Discovery-Search (Current work), Analytics-Clusters

Tue, Jun 1

JAllemandou closed T283536: Request to delete test_gsc_* datasets from Druid (& Superset/Turnilo) as Resolved.
Tue, Jun 1, 4:07 PM · Analytics-Kanban, Product-Analytics, Analytics

Mon, May 31

JAllemandou added a comment to T283536: Request to delete test_gsc_* datasets from Druid (& Superset/Turnilo).

Done

Mon, May 31, 11:47 AM · Analytics-Kanban, Product-Analytics, Analytics
JAllemandou moved T283536: Request to delete test_gsc_* datasets from Druid (& Superset/Turnilo) from Next Up to Done on the Analytics-Kanban board.
Mon, May 31, 11:46 AM · Analytics-Kanban, Product-Analytics, Analytics

Wed, May 26

JAllemandou added a comment to T221890: Add wikidata ids to data lake tables.

Actually this table is now production-style on the cluster, at path hdfs:///wmf/data/wmf/wikidata/item_page_link, or hive table wmf.wikidata_item_page_link.
It is released weekly and takes advantage of events for pages creation/deletion/moves to be as precise as possible (we have monthly snapshots of the page table, and get the current month info from events).

Wed, May 26, 6:04 PM · Epic, Analytics, Product-Analytics

Tue, May 25

JAllemandou added a comment to T283256: Extract operator/nodes/triples/paths/exprs list from queries.

The problem I see with using a generic class in the QueryElem object is the conversion to parquet. I don't think it'll work out of the box, leading to having to devise our own conversion. Let's brainstorm on ideas on this, possibly in meeting to make it faster :)

Tue, May 25, 8:30 AM · Wikidata, Wikidata-Query-Service

May 21 2021

dcausse awarded T282129: Test triple-analysis functions over a large dataset with Spark a Love token.
May 21 2021, 7:19 AM · Wikidata, Wikidata-Query-Service

May 20 2021

JAllemandou edited projects for T283261: Define priorities for HDFS data to be backed up, added: Analytics; removed Analytics-Clusters, Data-Persistence-Backup.
May 20 2021, 5:20 PM · Analytics
JAllemandou added a comment to T283261: Define priorities for HDFS data to be backed up.

ACk - doing so - thanks @elukey

May 20 2021, 5:20 PM · Analytics
JAllemandou awarded T283254: Wikistats should allow more than one project a Love token.
May 20 2021, 5:18 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T283261: Define priorities for HDFS data to be backed up.

@LSobanski You're absolutely right, this task is about documenting on our end the priorities and sizes of datasets to be backed up so that we can be better inform next steps (including potential implementation) later.

May 20 2021, 5:14 PM · Analytics
JAllemandou created T283261: Define priorities for HDFS data to be backed up.
May 20 2021, 5:08 PM · Analytics
JAllemandou created T283258: Provide a job regularly deleting wdqs processed query after 90 days.
May 20 2021, 4:28 PM · Discovery-Search (Current work), Patch-For-Review, Wikidata, Wikidata-Query-Service
JAllemandou updated the task description for T273854: Automate regular WDQS query parsing and data-extraction.
May 20 2021, 4:25 PM · Discovery-Search (Current work), Patch-For-Review, Wikidata-Query-Service, Wikidata, Analytics
JAllemandou added a subtask for T280640: Refine WDQS queries analysis: T273854: Automate regular WDQS query parsing and data-extraction.
May 20 2021, 4:24 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
JAllemandou added a parent task for T273854: Automate regular WDQS query parsing and data-extraction: T280640: Refine WDQS queries analysis.
May 20 2021, 4:24 PM · Discovery-Search (Current work), Patch-For-Review, Wikidata-Query-Service, Wikidata, Analytics
JAllemandou placed T273854: Automate regular WDQS query parsing and data-extraction up for grabs.
May 20 2021, 4:24 PM · Discovery-Search (Current work), Patch-For-Review, Wikidata-Query-Service, Wikidata, Analytics
JAllemandou created T283256: Extract operator/nodes/triples/paths/exprs list from queries.
May 20 2021, 4:21 PM · Wikidata, Wikidata-Query-Service
JAllemandou created T283255: Create CLI job extracting info from wdqs queries.
May 20 2021, 4:18 PM · Wikidata, Wikidata-Query-Service
JAllemandou closed T282129: Test triple-analysis functions over a large dataset with Spark, a subtask of T280640: Refine WDQS queries analysis, as Resolved.
May 20 2021, 12:26 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
JAllemandou closed T282129: Test triple-analysis functions over a large dataset with Spark as Resolved.
May 20 2021, 12:26 PM · Wikidata, Wikidata-Query-Service
JAllemandou added a comment to T282129: Test triple-analysis functions over a large dataset with Spark.

Closing this task :) Thanks fro the great work @AKhatun_WMF

May 20 2021, 12:26 PM · Wikidata, Wikidata-Query-Service
JAllemandou closed T282130: Provide a way to save extracted query-information in parquet format as Resolved.

Great ! Thanks for that :) Closing the ticket.

May 20 2021, 11:59 AM · Wikidata, Wikidata-Query-Service
JAllemandou closed T282130: Provide a way to save extracted query-information in parquet format, a subtask of T280640: Refine WDQS queries analysis, as Resolved.
May 20 2021, 11:59 AM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
JAllemandou added a comment to T282130: Provide a way to save extracted query-information in parquet format.

@AKhatun_WMF That's great! could you please provide some info on expected data-size in parquet (for daily data for instance)? Many thanks.

May 20 2021, 7:11 AM · Wikidata, Wikidata-Query-Service

May 19 2021

JAllemandou closed T279380: Add Traffic's notion of "from public cloud" to Analytics webrequest data as Resolved.

The new field is in turnilo with data starting from May 18th 2021.
https://w.wiki/3MJq
Resolving the task :)

May 19 2021, 4:37 PM · Analytics-Kanban, SRE, Analytics, Traffic
JAllemandou updated the task description for T279380: Add Traffic's notion of "from public cloud" to Analytics webrequest data.
May 19 2021, 4:35 PM · Analytics-Kanban, SRE, Analytics, Traffic
JAllemandou moved T279380: Add Traffic's notion of "from public cloud" to Analytics webrequest data from Ready to Deploy to Done on the Analytics-Kanban board.
May 19 2021, 4:35 PM · Analytics-Kanban, SRE, Analytics, Traffic
JAllemandou closed T282178: Article missing from the Clickstream dataset as Resolved.

Resolving for now - please reopen if needed :)

May 19 2021, 4:08 PM · Analytics-Kanban, Analytics

May 18 2021

JAllemandou updated subscribers of T283084: Missing hourly partition for event.mediawiki_revision_recommandation_create.

Heya @EBernhardson, not having canary events in the refined data is expected: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/TransformFunctions.scala#L50
I nonetheless think that partitions should be added even if the dataset is empty - ping @Ottomata on this.

May 18 2021, 6:44 PM · Analytics-Kanban, Patch-For-Review, Discovery-Search (Current work), Analytics-Clusters
JAllemandou added a comment to T282178: Article missing from the Clickstream dataset.

The clickstream algorithm reduces one step of redirects, meaning that if page A redirects to page B, views for page A are counted for page B. Multiple steps redirects are resolved 1 step, meaning that for instance: A -> B -> C and A is visited, the view counts for page B.

May 18 2021, 12:47 PM · Analytics-Kanban, Analytics
JAllemandou moved T282178: Article missing from the Clickstream dataset from Next Up to Done on the Analytics-Kanban board.
May 18 2021, 10:04 AM · Analytics-Kanban, Analytics
JAllemandou added a comment to T282178: Article missing from the Clickstream dataset.

@diego: page-titles in clickstream use _ as separator, not space!

May 18 2021, 10:03 AM · Analytics-Kanban, Analytics
JAllemandou added a comment to T280649: Update refinery-cassandra dependencies to have support for Cassandra 3.

@hnowlan : Logs at DEBUG level (prepare our eyes, those are verbose!)

May 18 2021, 8:31 AM · Patch-For-Review, Analytics-Kanban

May 17 2021

JAllemandou added a comment to T282618: Superset query timeouts for charts using Druid table.

And the feature request: https://github.com/apache/druid/issues/11264

May 17 2021, 4:18 PM · Analytics
JAllemandou added a comment to T282618: Superset query timeouts for charts using Druid table.

Also: One way to get results is to set the time-grain to the value: original value. This makes calcite use the topN query (single field in group-by instead of two). You'll get daily values instead of monthly but at least you'll have values :)

May 17 2021, 3:52 PM · Analytics
JAllemandou added a comment to T280649: Update refinery-cassandra dependencies to have support for Cassandra 3.

@hnowlan : Here is a way to access failure logs from todays job (when host was down):
from an-launcher1002:

sudo -u analytics kerberos-run-command analytics yarn logs --applicationId application_1620304990193_40662 | less
May 17 2021, 3:10 PM · Patch-For-Review, Analytics-Kanban
JAllemandou moved T280649: Update refinery-cassandra dependencies to have support for Cassandra 3 from In Code Review to In Progress on the Analytics-Kanban board.
May 17 2021, 12:10 PM · Patch-For-Review, Analytics-Kanban
JAllemandou claimed T279380: Add Traffic's notion of "from public cloud" to Analytics webrequest data.
May 17 2021, 12:10 PM · Analytics-Kanban, SRE, Analytics, Traffic
JAllemandou moved T279380: Add Traffic's notion of "from public cloud" to Analytics webrequest data from Next Up to In Code Review on the Analytics-Kanban board.
May 17 2021, 12:10 PM · Analytics-Kanban, SRE, Analytics, Traffic
JAllemandou added a comment to T279380: Add Traffic's notion of "from public cloud" to Analytics webrequest data.

@CDanis the patch for Druid is there - sorry for not having acted quicker.

May 17 2021, 12:09 PM · Analytics-Kanban, SRE, Analytics, Traffic
JAllemandou added a comment to T282618: Superset query timeouts for charts using Druid table.

TL;DR: This problem comes from how queries are translated from SQL to druid-query-plan. I don't have a solution for this :(

May 17 2021, 11:50 AM · Analytics
JAllemandou added a comment to T282632: Superset Presto LIMIT >10000 error .

Hi @SNowick_WMF, I double checked the number of expected rows and got 11161, not 80633 as you mentioned.

May 17 2021, 10:52 AM · Analytics-Kanban, Analytics

May 12 2021

JAllemandou added a comment to T278423: Upgrade the Hadoop masters to Debian Buster.

The plan looks great @razzi , and the comments as well!
My nits on some small things.

May 12 2021, 3:57 PM · Patch-For-Review, Analytics-Kanban, Analytics-Clusters
JAllemandou added a comment to T282657: Adding data from centralauth to the lake and the mediawiki_history dataset.

Hi @Pablo - Do you know in which DB the data is stored? if it is in the centralauth one we don' have it. This task should then become adding data from centralauth to the lake and the mediawiki_history dataset.

May 12 2021, 10:07 AM · Research, Analytics

May 11 2021

JAllemandou added a comment to T280107: Generate dump of scored-revisions from 2018-2020 for Wikis except English Wikipedia.

Hi - I am trying to make this happen.
Data for the wikidata project is very big (many edits, and the itemquality model to be added to the other ones). Is it needed for you or can I not export this project (this would be all models for all edits of all projects except enwiki and wikidatawiki).
Thanks

May 11 2021, 3:28 PM · Analytics-Kanban, artificial-intelligence, editquality-modeling, ORES, Machine-Learning-Team, Analytics

May 6 2021

JAllemandou created T282139: Provide a quantitative description of the Wikidata-triples dataset.
May 6 2021, 1:48 PM · Wikidata, Wikidata-Query-Service
JAllemandou added a subtask for T280640: Refine WDQS queries analysis: T282130: Provide a way to save extracted query-information in parquet format.
May 6 2021, 1:34 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
JAllemandou added a parent task for T282130: Provide a way to save extracted query-information in parquet format: T280640: Refine WDQS queries analysis.
May 6 2021, 1:34 PM · Wikidata, Wikidata-Query-Service
JAllemandou created T282130: Provide a way to save extracted query-information in parquet format.
May 6 2021, 1:34 PM · Wikidata, Wikidata-Query-Service
JAllemandou added a subtask for T280640: Refine WDQS queries analysis: T282129: Test triple-analysis functions over a large dataset with Spark.
May 6 2021, 1:31 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
JAllemandou added a parent task for T282129: Test triple-analysis functions over a large dataset with Spark: T280640: Refine WDQS queries analysis.
May 6 2021, 1:31 PM · Wikidata, Wikidata-Query-Service
JAllemandou created T282129: Test triple-analysis functions over a large dataset with Spark.
May 6 2021, 1:31 PM · Wikidata, Wikidata-Query-Service
JAllemandou added a subtask for T280640: Refine WDQS queries analysis: T282127: Add unit-tests to WDQS analysis toolkit.
May 6 2021, 1:28 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
JAllemandou added a parent task for T282127: Add unit-tests to WDQS analysis toolkit: T280640: Refine WDQS queries analysis.
May 6 2021, 1:28 PM · Wikidata, Wikidata-Query-Service
JAllemandou created T282127: Add unit-tests to WDQS analysis toolkit.
May 6 2021, 1:27 PM · Wikidata, Wikidata-Query-Service

May 4 2021

JAllemandou added a comment to T280011: Top read repeats.

Thanks @kzimmerman for the heads up :)
On our side we don't forget the improvement of heuristics.

May 4 2021, 6:11 PM · Product-Analytics, Analytics
JAllemandou created T281808: Wikidata all-json dumps not available from 2021-04-26.
May 4 2021, 9:11 AM · wdwb-tech, Analytics, Dumps-Generation, Wikidata

May 3 2021

JAllemandou claimed T281668: Fix sqoop script to use timestamp limits in `--boundary-query` queries.
May 3 2021, 3:08 PM · Analytics-Kanban, Analytics
JAllemandou moved T281668: Fix sqoop script to use timestamp limits in `--boundary-query` queries from Next Up to Done on the Analytics-Kanban board.
May 3 2021, 10:24 AM · Analytics-Kanban, Analytics
JAllemandou added a project to T281668: Fix sqoop script to use timestamp limits in `--boundary-query` queries: Analytics-Kanban.
May 3 2021, 10:24 AM · Analytics-Kanban, Analytics
JAllemandou created T281668: Fix sqoop script to use timestamp limits in `--boundary-query` queries.
May 3 2021, 7:32 AM · Analytics-Kanban, Analytics

Apr 29 2021

JAllemandou added a comment to T280565: Improve pageview automated traffic detection heuristics.

@JAllemandou There are a couple of other tickets (T270784, T274823) that might be resolved if the automated traffic detection heuristics are improved; should I add them as subtasks?

Apr 29 2021, 4:58 PM · Analytics
JAllemandou added a comment to T280844: Too many views to Skathi (moon) on enwiki.

@kzimmerman : I added the task as a subtask of T280565.
I did some further analysis:

  • Constant distinct IPs and user-agents hourly over a day (~180 ips, ~450 user agents, less during low-hours of circadian pattern)
  • Despite being categorized as 'desktop' and 'mobile-wep', all the views are from mobile-web, with android being good citizen and sending detailed user-agent info and iOS not so much, doing its requests through Pandas-VPN on desktop site with not detailed user-agent.
  • I looked some IPs from the set, and they are from different cloud/dedicated servers providers.
Apr 29 2021, 8:00 AM · Analytics, Product-Analytics, Pageviews-Anomaly
JAllemandou added a parent task for T280844: Too many views to Skathi (moon) on enwiki: T280565: Improve pageview automated traffic detection heuristics.
Apr 29 2021, 7:54 AM · Analytics, Product-Analytics, Pageviews-Anomaly
JAllemandou added a subtask for T280565: Improve pageview automated traffic detection heuristics: T280844: Too many views to Skathi (moon) on enwiki.
Apr 29 2021, 7:54 AM · Analytics
JAllemandou added a comment to T279567: Review request: New datasets for WMCZ published under analytics.wikimedia.org.

You say I will need to "start reworking some of your script to Airflow" – are there any help materials about what needs to be done?

Apr 29 2021, 6:51 AM · WMCZ-Stats, Analytics

Apr 27 2021

JAllemandou added a comment to T279567: Review request: New datasets for WMCZ published under analytics.wikimedia.org.

Hi @Urbanecm - Sorry for the late reply, I wanted to discuss with the team, and it happened yesterday.

Apr 27 2021, 5:59 PM · WMCZ-Stats, Analytics

Apr 21 2021

JAllemandou moved T280649: Update refinery-cassandra dependencies to have support for Cassandra 3 from In Progress to In Code Review on the Analytics-Kanban board.
Apr 21 2021, 2:59 PM · Patch-For-Review, Analytics-Kanban
JAllemandou moved T280649: Update refinery-cassandra dependencies to have support for Cassandra 3 from Next Up to In Progress on the Analytics-Kanban board.
Apr 21 2021, 8:16 AM · Patch-For-Review, Analytics-Kanban
JAllemandou moved T271232: Replace Camus by Gobblin from In Progress to Paused on the Analytics-Kanban board.
Apr 21 2021, 8:16 AM · Analytics-Kanban, Analytics
JAllemandou moved T278551: Duplicate wikitext entries for a bunch of wikis in 2021-02 snapshot from Paused to Done on the Analytics-Kanban board.
Apr 21 2021, 8:16 AM · Analytics-Kanban, Analytics

Apr 20 2021

JAllemandou created T280640: Refine WDQS queries analysis.
Apr 20 2021, 9:39 AM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service

Apr 19 2021

JAllemandou created T280565: Improve pageview automated traffic detection heuristics.
Apr 19 2021, 4:20 PM · Analytics
JAllemandou added a comment to T279567: Review request: New datasets for WMCZ published under analytics.wikimedia.org.

Follow up questions after having talked to the team:

  • How frequent does the job need to be run, and new data released ?
  • If not a one-off, would we have a process if you scripts change for us to review again?
Apr 19 2021, 4:07 PM · WMCZ-Stats, Analytics
JAllemandou awarded T280549: Consolidate labs / production sqoop lists to a single list a Hungry Hippo token.
Apr 19 2021, 3:35 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T279567: Review request: New datasets for WMCZ published under analytics.wikimedia.org.

Hi @Urbanecm , thank you for pinging us on this :)
The usual pattern for data publication is to ask for an approval through a security review.
I have quickly checked your code and it seems that the data you use to generate your dataset is already public - In that case the review might not even be needed.
I'll confirm with the team the procedure to follow, and will also ask for minor changes in our code (for instance accessing hive user table instead of the production-replica one, preventing to leak potential PII).

Apr 19 2021, 2:54 PM · WMCZ-Stats, Analytics
JAllemandou updated subscribers of T280011: Top read repeats.

@kzimmerman Hi - Is this task something your team could look at? I have triple checked and confirm that at least a few of the listed pages show unnatural patterns:

Apr 19 2021, 2:31 PM · Product-Analytics, Analytics
JAllemandou added a comment to T280168: Hive: create table statement failure.

I managed to have this working in my personal database. Can we sync on this via IRC @nettrom_WMF ?

Apr 19 2021, 2:20 PM · Analytics, Product-Analytics
JAllemandou added a comment to T273310: Reduce partition granularity of hive tables.

something to note: Hive separate table metadata from storage. When using external tables in Hive, dropping the tables only deletes the metadata, not the data itself:

hdfs dfs -du -s -h /user/hive/warehouse/bd808.db/*
Apr 19 2021, 2:01 PM · Analytics-Radar, User-bd808
JAllemandou updated subscribers of T94019: Generate RDF from JSON.

Info: There already is in the cluster a job doing TTL -> RDF conversion. The TTL dumps are imported weekly, and converted to blazegraph RDF once available.
The job is maintained by the Search Platform team (ping @dcausse ' :).

Apr 19 2021, 10:16 AM · wdwb-tech, Patch-For-Review, Wikidata

Apr 8 2021

JAllemandou closed T278815: Produce a list of wiki projects ranked by number of eligible voters in Board elections as Resolved.
Apr 8 2021, 11:35 AM · Analytics
JAllemandou updated subscribers of T278815: Produce a list of wiki projects ranked by number of eligible voters in Board elections.

Ping @kzimmerman on the above comment - Let's synchronize on who does what :)

Apr 8 2021, 11:35 AM · Analytics

Apr 7 2021

JAllemandou closed T279095: Sqoop on multi-instance clouddb1021 is very slow for some tables as Declined.

Thanks for your suggestion @Marostegui.
The global drift is not big (this month took 4h more than the previous one, less than 10% increase overall).
As discussed with @Milimetric there would be multiple options to try to make the overall process faster, but we are not going to prioritize this for now.
Let's close and reopen if needed.
Many thanks.

Apr 7 2021, 3:24 PM · Cloud-Services, Data-Persistence (Consultation), Analytics
JAllemandou added a comment to T279055: Filename convention is not easy to follow for dumps using a `precombine` step.

I have implemented some more logic to get the files we need, so no real need to change here.
This task was more about things to keep in mind if for instance filenames change at some point :)
Feel free to close it if it's not useful. Thank you for your explanations :)

Apr 7 2021, 12:27 PM · Dumps-Generation, Analytics-Radar
JAllemandou added a comment to T279055: Filename convention is not easy to follow for dumps using a `precombine` step.

Thank you for the explanation @ArielGlenn.
Let me precise my 2 concerns (they are minor):

  • job names are different for the same output in dumpstatus.json: for small wikis you should look at xmlstubsdump while for big you should look at xmlstubsdumprecombine (this is not easy to monitor all projects).
  • filenames share the same pattern between different jobs, making it confusing to get data across multiple projects with a single job. For pages-meta-current, you should get PROJECT-DATE-pages-meta-current.xml.bz2 even if PROJECT-DATE-pages-meta-current*.xml*.bz2 exist, since small projects won't have the split files and you want all projects to match. For pages-meta-history you should get PROJECT-DATE-pages-meta-history*.xml*.bz2 as there is supposedly never both single files and split-by-pages files.
Apr 7 2021, 8:14 AM · Dumps-Generation, Analytics-Radar
JAllemandou awarded T279380: Add Traffic's notion of "from public cloud" to Analytics webrequest data a Baby Tequila token.
Apr 7 2021, 7:58 AM · Analytics-Kanban, SRE, Analytics, Traffic
JAllemandou updated the task description for T279380: Add Traffic's notion of "from public cloud" to Analytics webrequest data.
Apr 7 2021, 7:58 AM · Analytics-Kanban, SRE, Analytics, Traffic
JAllemandou added a comment to T279380: Add Traffic's notion of "from public cloud" to Analytics webrequest data.

+1 on the approach (updating the task description for details)

Apr 7 2021, 7:57 AM · Analytics-Kanban, SRE, Analytics, Traffic

Apr 6 2021

JAllemandou added a comment to T278815: Produce a list of wiki projects ranked by number of eligible voters in Board elections.

There you go @Qgil :)

Apr 6 2021, 5:21 PM · Analytics

Apr 1 2021

JAllemandou created T279095: Sqoop on multi-instance clouddb1021 is very slow for some tables.
Apr 1 2021, 7:13 PM · Cloud-Services, Data-Persistence (Consultation), Analytics
JAllemandou updated subscribers of T277062: Review the Yarn Capacity scheduler and see if we can move to it.

We could do:

  • fifo - 5%
  • default - 35%
  • production - 50%
  • essential - 10%
Apr 1 2021, 6:04 PM · Analytics-Kanban, Patch-For-Review, Analytics-Clusters
JAllemandou moved T269211: Convert labsdb1012 from multi-source to multi-instance from Next Up to Ready to Deploy on the Analytics-Kanban board.
Apr 1 2021, 5:26 PM · Analytics-Kanban, cloud-services-team (Kanban), Data-Services, DBA, Patch-For-Review, Analytics-Clusters
JAllemandou moved T278551: Duplicate wikitext entries for a bunch of wikis in 2021-02 snapshot from Next Up to Paused on the Analytics-Kanban board.
Apr 1 2021, 5:26 PM · Analytics-Kanban, Analytics
JAllemandou added a project to T278551: Duplicate wikitext entries for a bunch of wikis in 2021-02 snapshot: Analytics-Kanban.
Apr 1 2021, 5:26 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T278551: Duplicate wikitext entries for a bunch of wikis in 2021-02 snapshot.

I confirm data is fixed for snapshot=2021-02 - Let's keep this open to remember monitoring next snapshot.

Apr 1 2021, 5:25 PM · Analytics-Kanban, Analytics