Improve df_to_remarkup formatting for wmfdata-python
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	AndrewTavis_WMDE
	Jul 11 2023, 2:36 PM

Description

In working a bit with wmfdata-python I just found the function df_to_remarkup. I had already written another version of this function for myself without realizing it was in wmfdata/utils.py, and upon checking out the one within the package I'm thinking that we could do a better job of formatting the output by directly using pandas.DataFrame.to_markdown and doing a minor edit of the output.

Here's a basic example:

wmf.utils.df_to_remarkup(df_rest_api_http_status)

... results in:

| http_status | total_requests | percent_of_total
| ----- | ----- | ----- 
| 200 | 119941 | 96.5
| 404 | 2507 | 2.0
| 304 | 1231 | 1.0
| 308 | 307 | 0.2
| 429 | 227 | 0.2
| 400 | 126 | 0.1

What I'd written before finding df_to_remarkup is:

print(df_rest_api_http_status.to_markdown(index=False, tablefmt="pipe").replace(":", "-"))

... that results in:

|   http_status |   total_requests |   percent_of_total |
|---------------|------------------|--------------------|
|           200 |           119941 |               96.5 |
|           404 |             2507 |                2   |
|           304 |             1231 |                1   |
|           308 |              307 |                0.2 |
|           429 |              227 |                0.2 |
|           400 |              126 |                0.1 |

The .replace(":", "-") fixes the header orientation formatting from |--------------:|-----------------:|-------------------:|, with this incorrectly adding a row for the hyphens/bars/colons below the header in Remarkup and further does not format the header with a different background color. I like that 2.0 and 1.0 are maintained in df_to_remarkup, and would implement a change that allows for integer floats to maintain their decimal place. Specifically pandas is already loaded into wmfdata/utils.py, so we could potentially use the above code and a minor edit to maintain floats for the task's work. The resulting code would also be more concise.

Contribution

I'd work on this myself if it's deemed to be something that we'd want to implement. I'm seeing there are a couple of things for the documentation that I could also send along in a separate PR :) Obviously this isn't a major change as it's all getting copy-pasted for the same result, but looking at a nicely formatted markup table's always nicer than a poorly formatted one in my opinion 😇

Let me know if there's something else I should do for future wmfdata-python issues!

Related Objects

Mentioned In: T348999: Add linter and formatter to wmfdata-python (and link check)

Event Timeline

AndrewTavis_WMDE created this task.Jul 11 2023, 2:36 PM

Restricted Application added projects: Data-Engineering, Product-Analytics. · View Herald TranscriptJul 11 2023, 2:36 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

AndrewTavis_WMDE updated the task description. (Show Details)Jul 11 2023, 2:39 PM

AndrewTavis_WMDE updated the task description. (Show Details)Jul 11 2023, 2:48 PM

AndrewTavis_WMDE updated the task description. (Show Details)

AndrewTavis_WMDE updated the task description. (Show Details)Jul 11 2023, 2:51 PM

AndrewTavis_WMDE updated the task description. (Show Details)Jul 11 2023, 3:59 PM

AndrewTavis_WMDE updated the task description. (Show Details)

Note that the edit to maintain the floating point number decimal places is achieved by passing tbe floatfmt kwarg to tabulate via pandas.DataFrame.to_markdown in the following way:

print(df_rest_api_http_status.to_markdown(index=False, tablefmt="pipe", floatfmt=".1f").replace(":", "-"))

... that results in:

|   http_status |   total_requests |   percent_of_total |
|---------------|------------------|--------------------|
|           200 |           119941 |               96.5 |
|           404 |             2507 |                2.0 |
|           304 |             1231 |                1.0 |
|           308 |              307 |                0.2 |
|           429 |              227 |                0.2 |
|           400 |              126 |                0.1 |

AndrewTavis_WMDE updated the task description. (Show Details)Jul 11 2023, 4:05 PM

AndrewTavis_WMDE updated the task description. (Show Details)Jul 11 2023, 4:07 PM

AndrewTavis_WMDE updated the task description. (Show Details)Jul 12 2023, 8:16 AM

The point value for floatfmt=".1f" would also be calculated to make sure that we match it to the maximum decimal place needed :)

Hi Andrew! Thank you so much for digging into this. I think this is a great plan, so please do create a pull request on GitHub (we should definitely migrate to GitLab but haven't found time yet).

I'd suggest passing any arguments other df straight to the Pandas function, so people can use any of its parameters (particularly index=True, which addresses a major limitation in the current code) as well as any parameters supported by Tabulate.

Great, @nshahquinn-wmf! Thanks for the direction on it :) Will try to get to this in the coming week and we can discuss in the PR 😊

mpopov moved this task from Triage to Tracking on the Product-Analytics board.Jul 24 2023, 5:26 PM

@nshahquinn-wmf, just FYI I do have this on my radar. Sorry it's taking so long... I'm in the process of waiting for a new computer and then I'll have my full VS Code setup up and running. I'll update you when I start to work on this :)

@AndrewTavis_WMDE no worries at all! It would be a nice improvement, but obviously, there's no urgency at all.

nshahquinn-wmf triaged this task as Low priority.Sep 1 2023, 1:30 AM

Hey there, @nshahquinn-wmf! One week on after starting the new computer setup and I've been using this as a means to test it all out 🚀 Has been fun! Here's where we're at:

Within utils.py we have:

def df_to_remarkup(df, **kwargs):
    """
    Prints a Pandas dataframe as a Remarkup table suitable for pasting into
    Phabricator.

    Note that among many kwargs the following are useful for editing the output:
        index (bool): passing `False` removes the dataframe index (default is `True`)
            Ex: `df_to_remarkup(df, index=False)`

        floatfmt (str): the decimal place to round the outputs to
            Ex: `df_to_remarkup(df, floatfmt=".1f")` to round to the first decimal place

    See the options for pandas.DataFrame.to_markdown and the Python package tabulate for other kwargs.
    """
    print(df.to_markdown(tablefmt="pipe", **kwargs).replace(":", "-"))

I removed the suggestion to use it with DataFrame.pipe as now we have kwargs to account for. There likely is a way to do it with pipe still, but I find using it as a standalone function is more intuitive now :)

Here's a dummy df for local testing that's the output I had for the task description:

db = pd.DataFrame(
    np.array(
        [
            [200, 119941, 96.5],
            [404, 2507, 2],
            [304, 1231, 1],
            [308, 307, 0.2],
            [429, 227, 0.2],
            [400, 126, 0.1]
        ]
    ),
    columns=["http_status", "total_requests", "percent_of_total"]
)

And then using the function:

df_to_remarkup(db)

# |    |   http_status |   total_requests |   percent_of_total |
# |----|---------------|------------------|--------------------|
# |  0 |           200 |           119941 |               96.5 |
# |  1 |           404 |             2507 |                2   |
# |  2 |           304 |             1231 |                1   |
# |  3 |           308 |              307 |                0.2 |
# |  4 |           429 |              227 |                0.2 |
# |  5 |           400 |              126 |                0.1 |

df_to_remarkup(db, index=False, floatfmt="0.1f")

# |   http_status |   total_requests |   percent_of_total |
# |---------------|------------------|--------------------|
# |         200.0 |         119941.0 |               96.5 |
# |         404.0 |           2507.0 |                2.0 |
# |         304.0 |           1231.0 |                1.0 |
# |         308.0 |            307.0 |                0.2 |
# |         429.0 |            227.0 |                0.2 |
# |         400.0 |            126.0 |                0.1 |

Along with the function there would be some other minor changed lines for indentation fixes in utils.py. I'm noticing we have two spaces instead of 4 in a few places :)

Let me know what you think on this! Please let me know if anything should be changed as far as the function doc string or anything else.

Hope you've been well 😊

AndrewTavis_WMDE added a project: Wikidata Analytics (Kanban).Oct 5 2023, 3:14 PM

AndrewTavis_WMDE moved this task from Incoming to To-Do on the Wikidata Analytics (Kanban) board.

Maintenance_bot added a project: Wikidata.Oct 5 2023, 3:30 PM

Just repeating what I shared on Slack for accessibility:

I think these changes are absolutely great and ready to merge! Please make a pull request and I or someone else can merge it ASAP 😁

AndrewTavis_WMDE moved this task from To-Do to In Progress on the Wikidata Analytics (Kanban) board.Oct 16 2023, 12:08 PM

Thanks, @nshahquinn-wmf! I've made wikimedia/wmfdata-python/pull/46 with the discussed changes :) :)

AndrewTavis_WMDE moved this task from In Progress to In Review on the Wikidata Analytics (Kanban) board.Oct 16 2023, 12:55 PM

AndrewTavis_WMDE mentioned this in T348999: Add linter and formatter to wmfdata-python (and link check).Oct 16 2023, 3:06 PM

@nshahquinn-wmf, resolving this as the PR was merged 😊 Thanks for the support here!

AndrewTavis_WMDE closed this task as Resolved.Oct 23 2023, 3:18 PM

Improve df_to_remarkup formatting for wmfdata-pythonClosed, ResolvedPublicActions

Description

Description

Contribution

Related Objects

Event Timeline

Improve df_to_remarkup formatting for wmfdata-python
Closed, ResolvedPublic
Actions