The revscoring library has a lot of code duplication for similar features -- e.g. duplication between features.revision and features.parent_revision.
We also have some features hard-coded for enwiki (e.g. features.revision.image_links and features.revision.cite_links)
These should be re-implemented as meta-features that require parameterization.
E.g. in /datasources/meta/wikitext_parsing.py
from ..datasource import Datasource class WikiTextParseTree(Datasource): def __init__(self, text_datasource, name=None): ... def process(self, text): ...
In datasources/revision.py:
from .meta import WikiTextParseTree text = Datasource("revision.text") parse_tree = WikiTextParseTree(text, name="revision.parse_tree")
In datasources/parent_revision.py:
from .meta import WikiTextParseTree text = Datasource("parent_revision.text") parse_tree = WikiTextParseTree(text, name="parent_revision.parse_tree")
There are a large set of features that look like this that we can clean up.