an-laucher1002 hosts Airflow, Sqoop, and some other timer jobs. It's CPU usage seems way higher than we could expect:
Description
Related Objects
- Mentioned Here
- T307540: [Airflow Migration] Migrate reportupdater jobs
Event Timeline
I believe that the system load on an-launcher1002 is much lower than it was when this ticket was created.
Therefore, I would argue that we shouldperhaps close the ticket.
However, I just checked the graphs linked above and correllated it with the htop output.
One job in particular was particularly heavy:
When I check out the comment there, we find that this is mentioned in two reportupdater jobs.
btullis@an-launcher1002:/srv/reportupdater/jobs$ grep -RI Extension:Cite reportupdater-queries/reference-previews/cite.hql:-- Estimate the relative frequency of user interactions with Extension:Cite reportupdater-queries/reference-previews/baseline.hql:-- Estimate the relative frequency of user interactions with Extension:Cite
So it would appear that the reportupdater-reference-previews service is at least one of the heavy components on an-lancher1002.
Thie service is also using the hive CLI to do its work, which we wish to deprecate with the bullseye upgrade.
FInally, I checked the logs for this service and discovered that there is an error from ne part of the report.
btullis@an-launcher1002:/lib/systemd/system$ journalctl -u reportupdater-reference-previews.service -f -- Logs begin at Tue 2023-03-28 22:08:36 UTC. -- Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]: FAILED: ParseException line 42:0 cannot recognize input near 'event_counts' ';' '<EOF>' in joinSource Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]: 2023-03-29 07:36:26,986 - ERROR - Report "baseline" could not be executed because of error: object of type 'NoneType' has no len() Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]: Traceback (most recent call last): Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]: File "/srv/reportupdater/reportupdater/reportupdater/executor.py", line 134, in execute_hive Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]: report.results = self.normalize_results(report, None, tsv_reader) Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]: File "/srv/reportupdater/reportupdater/reportupdater/executor.py", line 198, in normalize_results Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]: empty_row = [report.start] + [None] * (len(normalized_header) - 1) Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]: TypeError: object of type 'NoneType' has no len() Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]: 2023-03-29 07:36:26,987 - INFO - Executing "<Report key=baseline type=hive granularity=days lag=10800 first_date=2019-11-02 start=2022-03-30 end=2022-03-31 db_key=None hql_template=-- -- Estimate the relative frequency of user interactions with Extension:Cite -- references: -- * H... sql_template=None script=None explode_by={} max_data_points=None graphite={'path': '{_metric}.{wiki}', 'metrics': {'reference_previews.baseline.pageviews': 'pageviews', 'reference_previews.baseline.footnote_clicks_per_pageview': 'footnote_clicks_per_pageview', 'reference_previews.baseline.content_clicks_per_pageview': 'content_clicks_per_pageview'}} results={'header': '[]', 'data': '0 rows'} group=None>"... Mar 29 11:34:12 an-launcher1002 kerberos-run-command[5517]: SLF4J: Class path contains multiple SLF4J bindings.
I suspect that we should follow up with the owner of the report to find out what to do about this.
What do you think @JAllemandou ? Should we just close this ticket about the CPU load on an-launcher1002.
We already have a task related to the migration of reportupdater jobs to Airflow: T307540: [Airflow Migration] Migrate reportupdater jobs (albeit not a tracking task).
In terms of the error from the reference-previews report, I guess that @awight would be the person to notify, given that he wrote it.
We discussed this in standup: the CPU load of the server has gone to an acceptable rate, and we plan to tackle report-updater queries as part of the airflow migration, as well as moving airflow to its own isntance. With this, we can mark this ticket as done.