When a request is sent to the TTS server with Arabic as language, the tokens in the response don't contain the input strings. Instead they include (as far as I can tell) vocalized versions of the input. The exact input is needed to match the tokens to the text.
This affects at least word highlighting (T168948) and reciting selected text (T133680).