Page MenuHomePhabricator

Outreachy proposal for Technology to transclude git content into wiki pages
Closed, DeclinedPublic

Description

Profile Information

Name: Akanksha Gupta
Preferred pronoun: she
Email: akanksha2879@gmail.com
irc nick: akangupt
Mediawiki User: Akangupt
Location: Mandi, Himachal Pradesh, India
Time Zone: UTC+5:30
Typical working hours: (Indian Standard Time)
5PM to 1AM - before 7th December 2015
10AM to 6PM - 7th December 2015 - 21st February 2015
5PM to 1AM - After 21st February 2015

Synopsis

It’s convenient to get the information from wiki pages directly rather than searching in git repositories for documentation and code snippets. To facilitate this, developers end up dumping the git-docs on wiki.
Wiki pages of git-docs are always either outdated or wrong. This is intuitive because developers are not required to keep track of every commit/merge in corresponding repositories and then update the wiki-pages accordingly.

Solution:

In order to avoid outdated, wrong documentations and facilitate the ease of inserting code snippets from git repositories to wiki, the proposal is to create an extension, which would transclude the git content in wiki and update them when corresponding git-docs get updated.
This extension will be allowed to access the data in extension.json (Required) and hooks.yaml (Optional). This data will be fed to wiki template that creates extension infobox and hook pages respectively.

Possible mentors

  1. S Page (@Spage)
  2. Gergő Tisza (@Tgr)
  3. Ankita Shukla (@Ankitashukla)

The plan

Step-1 :

Create an extension which can transclude content from git in wiki pages . This involves:

  • Creating the parser function '#GitAtWiki'.

    Git server can be requested for :
    • A file
    • Part of the file by specifying a regex.
    • Code snippet in a file.

Parser function will have a syntax like:

{{#GitAtWiki: FILENAME|repo=REPOSITORY|branch=BRANCHNAME|from:STARTPATTERN|to:ENDPATTERN|code:YES}}

Parameters for the parser function:

    • filename: Required. The filename whose content is going to be displayed. For example : {{#GitAtWiki: docs/hooks.txt}}
    • repo: Optional. Repository name. Default: The default repository will be set using the`$egGitAtWikiDefaultRepo` setting in LocalSetting file. For example, for Mediawiki core: $egGitHubDefaultRepo = 'wikimedia/mediawiki';
    • branch: Optional. The branch to look for the file. Default = master.
    • from: Optional. The start regex pattern from where content will be pulled. Default = Start of file.
    • to: Optional. The end regex pattern till where the content will be pulled. Default = End of file.
    • code: Optional. Specify code = 'yes' to highlight the fetched code. Default = 'no'.
  • Requesting the file over HTTP or file_get_contents and caching the response with an expiration time (approximately 3 days) and source url as a caching key. The raw content URL, which is used to fetch the files, can be modified using the`$egGitAtWikiDefaultUrl` setting in LocalSetting file. By default it will be set for Github:

$egGitAtWikiDefaultUrl = 'https://raw.githubusercontent.com';

$sourceUrl =  $egGitAtWikiDefaultUrl."/".$repo."/".$branch."/".$filename ;
$fetchedValue = \Http::get( $sourceUrl );

To transclude the file from Gitlab/phabricator-diffusion/git.wikimedia.org/gerrit.wikimedia.org : $egGitAtWikiDefaultUrl(by the user) and $sourceUrl (in the code) will be edited accordingly.

  • Making the received content renderable after sanitizing it.
File FormatRendering Method
MarkDown filesExample : Readme.md. These files will be converted to html using Michelf\Markdown or similar methods.
Mediawiki files and Wiki filesExample : Readme.mediawiki and lua.wiki . These files will be rendered as it is.
Text files etc.Example : Readme.txt. These files will be rendered using <poem> and <nowiki> tags.

For mediawiki files, wiki files (based on extension of file) and code snippets (if code parameter is YES) :

return array( $fileContents, 'nowiki' => false, 'noparse' => true, 'isHTML' => false );

For text files :

$output = '<poem>' . htmlspecialchars( $fileContents ) . '</poem>';
return array( $output, 'nowiki' => true, 'noparse' => true, 'isHTML' => true );

Step-2 :

Updating the wiki pages when git-docs get updated.
Idea is to maintain a mapping/database of source (git-doc) and destination (wiki-page) and adding a post commit hook which triggers the update remotely from Jenkins.

Step-3:

Write a PHP script to process the text given by extension. Extension will fetch the text from hooks.yaml and return the information about a specific hook with the help of PHP and lua. Hence, this extension will be used to create the individual hook pages easily.

Phases

Phase I: Request for a repository, wikimedia lab instance and explore existing parser functions and their code culture.
Phase II: Create the extension which satisfies the use cases when the git server is requested for a file or code snippet.
Phase III: Make the extension to follow the use cases when the git server is requested for the part of a file by specifying the regex patterns.
Phase IV: Create a bot to update the wiki-pages. (step-2)
Phase V: Write a mechanism to process the text. (step-3)
Phase VI: Deploy and test the final extension and document the extension.

Deliverables

  1. Required. The extension which satisfies the use cases when the git server is requested for a file or a code snippet. (phase II)
  2. Required. A feature in the extension for the use cases when the git server is requested for the part of file by specifying the regex patterns. (phase III)
  3. Required. A bot to update the wiki-pages. (phase IV)
  4. Optional. A mechanism to process the text given by the extension. (phase V)
  5. Required. Documentation for the GitAtWiki extension. (Phase VI)

Timeline

PeriodTask
Before Dec 7Request for a repository, get a wikimedia lab instance and explore existing parser functions and their code culture. (Phase I)
Dec 7 - Dec 10Create the basic setup/files for extension like internationalisation etc.
Dec 11 - Dec 28Write the script for the parser function which satisfies the use cases when the git server is requested a file or a code snippet. (Phase II)
Dec 29 - Jan 6Write the unit tests and test the code.
Jan 7 - Jan 16Get the security review, code review, address the issues/bugs and deploy the extension (THE MVP) . Start the documentation for the extension. Create a feature in the extension for the use cases when the git server is requested for the part of file by specifying regex pattern. (Phase III)
Jan 17 - Jan 22Write the unit tests and test the code.
Jan 23 - Jan 30Send the code for review and address the issues/bugs. Create the basic setup for the mechanism to update the wiki-pages.
Jan 31 - Feb 17Write the script to execute the functionalities of a mechanism which updates the wiki-pages. (Phase IV)
Feb 18 - Feb 26Write the unit tests and test the code.
Feb 27 - March 7Send the code for review, fix the bugs, deploy the extension and document the final extension. (Phase VI)

NOTE : Try to add the lua module, if complexity of work and time permits. (Phase V)

Additional Information About The Project

Existing extensions :

ExtensionPitfalls
Git2PagesExecs git clone and other commands to create local files which takes more space. It doesn’t have the option of only allowing a whitelist of sites, which can create a security hazard.
Gitweb and GitHubSends a HTTP request to the git server but doesn’t satisfy the use cases when git server is requested for : (1) Part of the file by specifying start and end line numbers. (2) Code snippet in a file.

After the project :

A bot, which can fetch all the information of a hook with the help of the lua module created in this project, can be built to create the missing hook pages . (T93043)

Participation

  • The progess report of the project will be updated weekly on this page.
  • The source code will be published on gerrit.
  • I will ask for help/doubts on MediaWiki-General, #wikimedia-dev and #wikimedia-tech irc channels as I've already been doing.
  • To get an opinion/feedback, I will use wikitech-I mailing list.

About Me

Education: I'm an undergraduate student of computer science at IIT Mandi, India.

How did you hear about the program?
From my friends who have already done an outreachy internship.

Do you meet the eligibility requirements outlined at https://wiki.gnome.org/Outreachy#Eligibility (if no, explain why not)?
Yes, I am eligible.

Will you have any other time commitments, such as school work, another job, planned vacation, etc., during the duration of the program?
1st December 2015 - 21st February 2016 : Winter vacations. I will not have any other commitments.
22nd February 2016 - 7th March 2016 : College will take 20 hours per week (Also, These days will be the starting 15 days of the college, hence the schedule is going to be very light as we are supposed to spend this time in examining and choosing the courses.)

I will dedicate at least 42 hours per week during my internship.

We advise all candidates eligible to Google Summer of Code and FOSS Outreach Program for Women to apply for both programs. Are you planning to apply to both programs and, if so, with what organization(s)?
Only OPW, for Wikimedia

What makes you want to make this the most awesomest wiki enhancement ever?
Seeing the ever growing community of developers and contributors in wiki motivates me. It becomes the platform to train the new generation of developers by its huge knowledge database but a problem arises, as I found, while in process of exploring the wiki for learning. Sometimes if information is available in a big amount it’s always advisable to keep your data at one place or at least only in one copy so as to reduce and minimise the inconsistencies in various data sources for the same information. Currently no extension provides the functionalities to auto modify the pages for all wikis and at least inform the end user that the content is revised or deprecated. I think with the implementation of this project this problem can be solved and its effect will be long lasting and far reaching.

Why did you choose wikimedia?
Wikimedia was the source of my initial knowledge about the internet world and was my first step into the cyber world and since then there has not been a day when I don’t go to Wikipedia or some other wiki for one or the other type of information. It is a dawn of the new era of learning through the internet and undeniably wiki is a major contributor to that learning process and in some terms defines the present day internet. I have always admired the elegance and at the same time the simplicity of the idea and admired its contributors for the wonderful piece of idea. It always has been on my wishlist to contribute to the growth of internet and the knowledge pool at large of the world and by this project this might just be true, to make the internet and the cyber age a better thing than it was a day ago.

Past Experience

I have done many coding projects during my graduation and implemented various algorithms and data structures. As part of my industrial internship I worked with Amazon India during summer of 2015.

Academic Projects :

  • Remote Directory Editor : Multiple clients can connect to the server and use various features. It provides you with simple shell interface and supports features mainly including removing, renaming, adding files, listing the files, moving and changing mode of the working directory.
  • Chat-room: A user signs in. User can see online users, add/remove/block/unblock them in local address book, change his status (online/busy), see his statistics, use admin features, configure account, message/file transfer. (Using PHP, MySQL, HTML, CSS, Javascript)
  • Web-email : A user signs up with additional features as user profile, contact address list. User can mail to all domain email addresses and can receive mails from same domain addresses. (Using hmail server , PHP, MySQL, HTML, CSS, Javascript)
  • A Fiery Vengeance : A Japanese role playing game(JRPG) with features like save game, inventory, Health interface, dynamic fights, turn-based fights and other small games. (Using Python, Tiled and open source library pygame )
  • Voice Guided Dehumidifier : A voice commanded system which will dehumidify the room with the help of desiccant material like silica gel.

Microtask completed : T115388

Event Timeline

Akangupt claimed this task.
Akangupt raised the priority of this task from to Medium.
Akangupt updated the task description. (Show Details)
Akangupt added subscribers: Ankitashukla, Hansika11, Devirk and 18 others.
Akangupt set Security to None.

We are approaching the Outreachy'11 application deadline, and if you want to have your proposal considered to be part of this round, do sign up and add your proposal at https://outreachy.gnome.org/ before November 02 2015, 07:00 pm UTC. You can copy-paste the above proposal to the Outreachy application system, and keep on polishing it over here. Keep in mind that your mentors and the organization team will be evaluating your proposal here in Phabricator, and you are free to ask and get more reviews complying https://www.mediawiki.org/wiki/Outreach_programs/Life_of_a_successful_project#Answering_your_questions

Congrats on writing a proposal!

Quick comment on tag syntax: the commitID need not be a requirement, it should default to HEAD. Likewise, repo can be optional and default to MediaWiki core. Also the repo should not be a URL but a project specifier within the repo. Take a look at mediawiki.org's Template:Git_file for intelligent defaults; this things should be like that template, only it actually transcludes the file contents rather than link to it.. It's a good question whether editors should be able to say "Transclude this file from a GitHub/Gitlab/random repo" (instead of from gerrit.wikimedia.org or phabricator.wikimedia.org/diffusion) in the parser tag, or whether it should only be a config option, see Extension:GitHub's $egSomethingUrl variable.

Likewise, repo can be optional and default to MediaWiki core.

I don't think it should. Extensions should not hardcode Wikimedia usecases. For convenient defaults, you can always use templates.

Profile:

  • Please clarify the time zone for the hours in the profile.

Synopsis:

  • It might be worth mentioning the structured data use case (e.g. pulling extension.json and providing the data therein to the wiki template that creates the extension infobox) since it might take some dedicated functionality to do that. Of course if you decide that does not fit into the roadmap that's fine.

Step 1:

  • You should give some more thought to how the repo parameter will work. You can use a git URL to specify the repo, every git repo will have that. There is no straightforward way to turn that into a HTTP request though. You can use git to get the content, but that's not trivial so you should give it some thoughts how exactly you will do that if you take that route. Alternatively, you can use a HTTP request to fetch the file, but there is no uniform way of putting a base URL + branch name + file name together into a full URL. (Have a look at how GitHub and git.wikimedia.org do it. You could check some other large git hosts, too.) Also if you go the HTTP route you might try end up with something that's not specific at all to git.
  • there is very little difference between commit IDs and branch names in git, they are both ways to refer to a specific commit. There are other ways (tags are the most important ones, but see this explanation of a commitish for the general case). No point in having separate parameters for those.
  • startline/endline is not that useful (compared to e.g. a regex). It might be handy in a few cases, but there should be better ways. (Again if you think those alternatives do not fit into your schedule that's fine, but should be mentioned explicitly.) I see that's mentioned later in some places, but it's a bit confusing.
  • it's unclear what the code parameter does.
  • if you request a file over HTTP, how do you determine the commit ID (to use as a caching key)? Do you require wiki editors to always provide one?
  • how does the extension choose which formatting to use? Guess based on file extension?
  • "Mediawiki files and Wiki files ... These files will be rendered as it is." - you probably mean their rendering will be left to the normal parser (ie. the extension will just pass them as wikitext).
  • return array( "<syntaxhighlight>$text</syntaxhighlight>", 'nowiki' => true, 'noparse' => true, 'isHTML' => true ); will result in a <syntaxhighlight> tag in the HTML code (which does not do anything). Invoking another tag extension is a bit more troublesome than that, and IMO there is no reason to do it at all. Whoever invokes your extension can always do <syntaxhighlight>{{#GitAtWiki|...}</syntaxhighlight> on their side (or something like that. The actual syntax is a bit more complicated.)

Step 2:

  • As said above, getting the commit id is not trivial. If the editor specifies something like "master", a HTTP request won't give it to you. If the editor specifies a commit id, there is no need for cache invalidation at all.
  • Not sure you should bother with line numbers in the cache - if the file changes, just update. The wikitext parser has its own cache; if the text provided by your parser function is the same, the update will be cheap.
  • It's a bit unclear what you expect the git hook to do. (Looking at the GitHub docs might help to give you an idea of how they work.

Step 3:

  • You probably want to do the processing in PHP and just have a lua module for interfacing with it. Doing things in Lua is hard. Look at e.g. the Site lua library - it's just a bunch of wrappers for PHP functionality.
  • You can also make the whole Lua thing into a stretch goal that you attempt if you progress faster than planned but don't put it on the schedule.

Deliverables/Timeline:

  • I don't think it's realistic to split "write code for feature X" and "test it / get it reviewed". Generally the process is smoother when you write code in small steps and publish them as soon as possible ("commit early, commit often") and then work on the next small patch while waiting for reviews.
  • An MVP is not an MVP unless it can be deployed right there and then (and we should deploy it right then, at least to the beta site). You might want to set apart some time for that after reaching the MVP stage (getting a security review, setting up the configuration etc - see Writing an extension for deployment).

Participation:

  • it's a good idea to commit to a specific update schedule (say, weekly reports).

@Tgr :

Please clarify the time zone for the hours in the profile.

I guess you mean this :

Time Zone: UTC+5:30
Typical working hours: (Indian Standard Time)

right?

It might be worth mentioning the structured data use case (e.g. pulling extension.json and providing the data therein to the wiki template that creates the extension infobox) since it might take some dedicated functionality to do that. Of course if you decide that does not fit into the roadmap that's fine.

What about composer.json for the case specific in which extension.json doesn't exist and for the cases where extension.json and composer.json both exist.

It's a bit unclear what you expect the git hook to do. (Looking at the GitHub docs might help to give you an idea of how they work.

I want to add a post commit hook which triggers the update remotely from Jenkins. I know Jenkins-fu might get complicated but for now, I can't think a better solution to trigger the update. comments?

how does the extension choose which formatting to use? Guess based on file extension?

yes.

"Mediawiki files and Wiki files ... These files will be rendered as it is." - you probably mean their rendering will be left to the normal parser (ie. the extension will just pass them as wikitext).

yes

I disagree with @Tgr, I think line numbers are a useful way to specify a snippet. The wiki editor should use them with an unchanging commit (otherwise the file contents change), but its her choice

Your step 2.

Idea is to maintain a mapping/database of source (git-doc) and destination (wiki-page) and adding a post commit hook which triggers the update remotely from Jenkins.

is interesting. If the wikipage is pulling in HEAD, then depending on caching the page will remain the same even when git updates. I don't know how much caching the parser cache and page cache do. If it's only cached for a day or so and editors can force new contents with ?action=purge, then maybe we don't need this complicated dependency handling.

1 .

I think line numbers are a useful way to specify a snippet. The wiki editor should use them with an unchanging commit (otherwise the file contents change)

First case: One idea is that we don't update these snippets as they are coming from a fixed commit (not from master) hence we don't need any update mechanism for these types of cases.
Second case: If we need to update them as this is the motto of the project then I am not sure how to find a simple and safe solution to update such snippets. I tried to use one in my previous project proposal as:

Idea is to maintain a mapping/database of source (git-doc), destination (wiki-page), commit-id (from which the content is taken), start line number (The start line number from where the content was pulled) and end line number (The end line number till where content was pulled). After any commit in the corresponding file, check the difference between the content of git-doc between the start line and end line of commit-id (saved in mapping) and the content of git-doc between start line and end line patterns in the most recent commit.
( Why patterns? : It might be the case that start line and end line numbers get modified after the most recent commit. By comparing the content using the line numbers might give a false information. Hence, the solution is to search for the pattern, which are present at git-doc' start line and at end line (for the saved commit-id in mapping), in the most recent commit.)
If a difference is reported by above comparison then previously saved commit-id, start line and end line will be replaced by present commit-id, start line and end line number.

But it uses a database and a complex approach.
Which one is good to proceed with?
2 .

Your step 2.

Idea is to maintain a mapping/database of source (git-doc) and destination (wiki-page) and adding a post commit hook which triggers the update remotely from Jenkins.

is interesting. If the wikipage is pulling in HEAD, then depending on caching the page will remain the same even when git updates. I don't know how much caching the parser cache and page cache do. If it's only cached for a day or so and editors can force new contents with ?action=purge, then maybe we don't need this complicated dependency handling.

I believe to start this extension should just rely on the page cache and maybe the parser cache.

  • Is this good to make editors to use ?action=purge to see the updated snippets each and every time?
  • To automate the above process, we need to trigger the update (delete the cache etc.) after each commit in the corresponding file and to know which page' cache needs to be deleted, we need to have a database/mapping. comments?

If you can avoid having a database in the MVP version, you definitely should. It makes things a lot simpler. Maybe you find later you really need to add it, maybe other tasks prove more important. The wikitext parser has a cache, expiration of that cache is a dumb but very easy to use update mechanism; just set it to something short, like one day.

Thank you for your proposal. Sadly, the Outreachy administration team made it strict that candidates with any kind of academic/other commitments are not approved for this round. Please consider talking with your mentors about further steps, and we hope to see your application ( the same too, if the consensus still exist ) in the next round of GSoC/Outreachy. Closing the same as declined, and all the very best for the next round!