Page MenuHomePhabricator

Investigate CPU usage on an-launcher1002
Closed, ResolvedPublic

Description

an-laucher1002 hosts Airflow, Sqoop, and some other timer jobs. It's CPU usage seems way higher than we could expect:

https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=an-launcher1002&var-datasource=thanos&var-cluster=analytics

Event Timeline

I believe that the system load on an-launcher1002 is much lower than it was when this ticket was created.
Therefore, I would argue that we shouldperhaps close the ticket.

However, I just checked the graphs linked above and correllated it with the htop output.

image.png (351×927 px, 44 KB)

One job in particular was particularly heavy:

image.png (81×1 px, 27 KB)

When I check out the comment there, we find that this is mentioned in two reportupdater jobs.

btullis@an-launcher1002:/srv/reportupdater/jobs$ grep -RI Extension:Cite
reportupdater-queries/reference-previews/cite.hql:-- Estimate the relative frequency of user interactions with Extension:Cite
reportupdater-queries/reference-previews/baseline.hql:-- Estimate the relative frequency of user interactions with Extension:Cite

So it would appear that the reportupdater-reference-previews service is at least one of the heavy components on an-lancher1002.
Thie service is also using the hive CLI to do its work, which we wish to deprecate with the bullseye upgrade.

FInally, I checked the logs for this service and discovered that there is an error from ne part of the report.

btullis@an-launcher1002:/lib/systemd/system$ journalctl -u reportupdater-reference-previews.service -f
-- Logs begin at Tue 2023-03-28 22:08:36 UTC. --
Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]: FAILED: ParseException line 42:0 cannot recognize input near 'event_counts' ';' '<EOF>' in joinSource
Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]: 2023-03-29 07:36:26,986 - ERROR - Report "baseline" could not be executed because of error: object of type 'NoneType' has no len()
Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]: Traceback (most recent call last):
Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]:   File "/srv/reportupdater/reportupdater/reportupdater/executor.py", line 134, in execute_hive
Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]:     report.results = self.normalize_results(report, None, tsv_reader)
Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]:   File "/srv/reportupdater/reportupdater/reportupdater/executor.py", line 198, in normalize_results
Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]:     empty_row = [report.start] + [None] * (len(normalized_header) - 1)
Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]: TypeError: object of type 'NoneType' has no len()
Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]: 2023-03-29 07:36:26,987 - INFO - Executing "<Report key=baseline type=hive granularity=days lag=10800 first_date=2019-11-02 start=2022-03-30 end=2022-03-31 db_key=None hql_template=-- -- Estimate the relative frequency of user interactions with Extension:Cite -- references: -- * H... sql_template=None script=None explode_by={} max_data_points=None graphite={'path': '{_metric}.{wiki}', 'metrics': {'reference_previews.baseline.pageviews': 'pageviews', 'reference_previews.baseline.footnote_clicks_per_pageview': 'footnote_clicks_per_pageview', 'reference_previews.baseline.content_clicks_per_pageview': 'content_clicks_per_pageview'}} results={'header': '[]', 'data': '0 rows'} group=None>"...
Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]: SLF4J: Class path contains multiple SLF4J bindings.

I suspect that we should follow up with the owner of the report to find out what to do about this.

What do you think @JAllemandou ? Should we just close this ticket about the CPU load on an-launcher1002.

We already have a task related to the migration of reportupdater jobs to Airflow: T307540: [Airflow Migration] Migrate reportupdater jobs (albeit not a tracking task).

In terms of the error from the reference-previews report, I guess that @awight would be the person to notify, given that he wrote it.

JAllemandou claimed this task.

We discussed this in standup: the CPU load of the server has gone to an acceptable rate, and we plan to tackle report-updater queries as part of the airflow migration, as well as moving airflow to its own isntance. With this, we can mark this ticket as done.