Page MenuHomePhabricator

Improve df_to_remarkup formatting for wmfdata-python
Closed, ResolvedPublic

Description

Description

In working a bit with wmfdata-python I just found the function df_to_remarkup. I had already written another version of this function for myself without realizing it was in wmfdata/utils.py, and upon checking out the one within the package I'm thinking that we could do a better job of formatting the output by directly using pandas.DataFrame.to_markdown and doing a minor edit of the output.

Here's a basic example:

wmf.utils.df_to_remarkup(df_rest_api_http_status)

... results in:

| http_status | total_requests | percent_of_total
| ----- | ----- | ----- 
| 200 | 119941 | 96.5
| 404 | 2507 | 2.0
| 304 | 1231 | 1.0
| 308 | 307 | 0.2
| 429 | 227 | 0.2
| 400 | 126 | 0.1

What I'd written before finding df_to_remarkup is:

print(df_rest_api_http_status.to_markdown(index=False, tablefmt="pipe").replace(":", "-"))

... that results in:

|   http_status |   total_requests |   percent_of_total |
|---------------|------------------|--------------------|
|           200 |           119941 |               96.5 |
|           404 |             2507 |                2   |
|           304 |             1231 |                1   |
|           308 |              307 |                0.2 |
|           429 |              227 |                0.2 |
|           400 |              126 |                0.1 |

The .replace(":", "-") fixes the header orientation formatting from |--------------:|-----------------:|-------------------:|, with this incorrectly adding a row for the hyphens/bars/colons below the header in Remarkup and further does not format the header with a different background color. I like that 2.0 and 1.0 are maintained in df_to_remarkup, and would implement a change that allows for integer floats to maintain their decimal place. Specifically pandas is already loaded into wmfdata/utils.py, so we could potentially use the above code and a minor edit to maintain floats for the task's work. The resulting code would also be more concise.

Contribution

I'd work on this myself if it's deemed to be something that we'd want to implement. I'm seeing there are a couple of things for the documentation that I could also send along in a separate PR :) Obviously this isn't a major change as it's all getting copy-pasted for the same result, but looking at a nicely formatted markup table's always nicer than a poorly formatted one in my opinion 😇

Let me know if there's something else I should do for future wmfdata-python issues!

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)

Note that the edit to maintain the floating point number decimal places is achieved by passing tbe floatfmt kwarg to tabulate via pandas.DataFrame.to_markdown in the following way:

print(df_rest_api_http_status.to_markdown(index=False, tablefmt="pipe", floatfmt=".1f").replace(":", "-"))

... that results in:

|   http_status |   total_requests |   percent_of_total |
|---------------|------------------|--------------------|
|           200 |           119941 |               96.5 |
|           404 |             2507 |                2.0 |
|           304 |             1231 |                1.0 |
|           308 |              307 |                0.2 |
|           429 |              227 |                0.2 |
|           400 |              126 |                0.1 |

The point value for floatfmt=".1f" would also be calculated to make sure that we match it to the maximum decimal place needed :)

Hi Andrew! Thank you so much for digging into this. I think this is a great plan, so please do create a pull request on GitHub (we should definitely migrate to GitLab but haven't found time yet).

I'd suggest passing any arguments other df straight to the Pandas function, so people can use any of its parameters (particularly index=True, which addresses a major limitation in the current code) as well as any parameters supported by Tabulate.

Great, @nshahquinn-wmf! Thanks for the direction on it :) Will try to get to this in the coming week and we can discuss in the PR 😊

@nshahquinn-wmf, just FYI I do have this on my radar. Sorry it's taking so long... I'm in the process of waiting for a new computer and then I'll have my full VS Code setup up and running. I'll update you when I start to work on this :)

@AndrewTavis_WMDE no worries at all! It would be a nice improvement, but obviously, there's no urgency at all.

Hey there, @nshahquinn-wmf! One week on after starting the new computer setup and I've been using this as a means to test it all out 🚀 Has been fun! Here's where we're at:

  1. Within utils.py we have:
def df_to_remarkup(df, **kwargs):
    """
    Prints a Pandas dataframe as a Remarkup table suitable for pasting into
    Phabricator.

    Note that among many kwargs the following are useful for editing the output:
        index (bool): passing `False` removes the dataframe index (default is `True`)
            Ex: `df_to_remarkup(df, index=False)`

        floatfmt (str): the decimal place to round the outputs to
            Ex: `df_to_remarkup(df, floatfmt=".1f")` to round to the first decimal place

    See the options for pandas.DataFrame.to_markdown and the Python package tabulate for other kwargs.
    """
    print(df.to_markdown(tablefmt="pipe", **kwargs).replace(":", "-"))

I removed the suggestion to use it with DataFrame.pipe as now we have kwargs to account for. There likely is a way to do it with pipe still, but I find using it as a standalone function is more intuitive now :)

  1. Here's a dummy df for local testing that's the output I had for the task description:
db = pd.DataFrame(
    np.array(
        [
            [200, 119941, 96.5],
            [404, 2507, 2],
            [304, 1231, 1],
            [308, 307, 0.2],
            [429, 227, 0.2],
            [400, 126, 0.1]
        ]
    ),
    columns=["http_status", "total_requests", "percent_of_total"]
)
  1. And then using the function:
df_to_remarkup(db)

# |    |   http_status |   total_requests |   percent_of_total |
# |----|---------------|------------------|--------------------|
# |  0 |           200 |           119941 |               96.5 |
# |  1 |           404 |             2507 |                2   |
# |  2 |           304 |             1231 |                1   |
# |  3 |           308 |              307 |                0.2 |
# |  4 |           429 |              227 |                0.2 |
# |  5 |           400 |              126 |                0.1 |

df_to_remarkup(db, index=False, floatfmt="0.1f")

# |   http_status |   total_requests |   percent_of_total |
# |---------------|------------------|--------------------|
# |         200.0 |         119941.0 |               96.5 |
# |         404.0 |           2507.0 |                2.0 |
# |         304.0 |           1231.0 |                1.0 |
# |         308.0 |            307.0 |                0.2 |
# |         429.0 |            227.0 |                0.2 |
# |         400.0 |            126.0 |                0.1 |

Along with the function there would be some other minor changed lines for indentation fixes in utils.py. I'm noticing we have two spaces instead of 4 in a few places :)

Let me know what you think on this! Please let me know if anything should be changed as far as the function doc string or anything else.

Hope you've been well 😊

Just repeating what I shared on Slack for accessibility:

I think these changes are absolutely great and ready to merge! Please make a pull request and I or someone else can merge it ASAP 😁

@nshahquinn-wmf, resolving this as the PR was merged 😊 Thanks for the support here!