Page MenuHomePhabricator

Using Parsoid as a wikitext bridge for importing content into wikitext format
Open, LowPublic

Description

Besides converting wikitext to HTML, Parsoid has the ability to convert HTML to wikitext. This opens up the possibility of taking HTML from different applications and converting them to wikitext.

Examples:

  1. converting phabricator output to wikitext
  2. converting markdown to wikitext
  3. converting google doc output to wikitext
  4. converting word doc output to wikitext

For #2, there is https://gerrit.wikimedia.org/r/#/c/225253/.
For #3, a quick prototype is at http://gwicke.github.io/paste2wiki/
For #4, ckeditor plugin for paste from word might be useful to look at to borrow tricks from for normalizing that HTML for conversion to wikitext

Strictly speaking Parsoid is not required, but Parsoid could provide a simpler / unified interface for this.

So, the goals of this project would be to:
(a) develop a webservice that provides these conversion utilities in a single place.
(b) talks to Parsoid under the hood to do these necessary conversions
(c) does any necessarily HTML cleanup / normalization to nudge Parsoid to provide clean wikitext
(d) tweak Parsoid's HTML to wikitext code to better enable these transformations

An example of (d) is T127207

Ideally, this code will be part of a NPM package that can then be used in Parsoid (and elsewhere that might benefit from it).

Details:

  • Primary mentor: <>
  • Co-mentor: <>
  • Other mentors: (optional, Phabricator username) <>
  • Skills: node.js, some familiarity with wikitext
  • Estimated project time for a senior contributor: 2-4 weeks
  • Microtasks: T129562

Event Timeline

ssastry renamed this task from to Using Parsoid as a wikitext bridge for importing content into wikitext format.Feb 18 2016, 4:34 PM
01tonythomas added a subscriber: 01tonythomas.

@ssastry : Thank you for adding in the task. I am adding in the Possible-Tech-Projects tag, so that prospective students will get to know about the same. I will be placing this in Missing-Mentors list, and if you are ready to mentor, please do add it to the task description.

ssastry updated the task description. (Show Details)
ssastry added a subscriber: Arlolra.

@01tonythomas I added mentor information to the description of the task. FYI.

01tonythomas updated the task description. (Show Details)

@ssastry Thanks for your input. As per https://www.mediawiki.org/wiki/Outreach_programs/Life_of_a_successful_project#Coming_up_with_a_proposal, we need one more mentor listed as Co-mentor to get this project featured for this round of GSoC/Outreachy.

@Arlolra , I'm keen to take up this project for Gsoc 2k16. I'd appreciate if we have a short discussion on this idea as I have plenty of ideas flooding through my mind. Presently I'm working on a small prototype model on nodejs which serves the same purpose on a smaller scale.
Thank you,
your sincerely,
vansh khanna

@ssastry , I'm keen to take up this project for Gsoc 2k16. I'd appreciate if we have a short discussion on this idea as I have plenty of ideas flooding through my mind. Presently I'm working on a small prototype model on nodejs which serves the same purpose on a smaller scale.
Thank you,
yours sincerely,
vansh khanna

@Khannaanant262129 We're on IRC in #mediawiki-parsoid or you can leave your thoughts here.

Also, see https://www.mediawiki.org/wiki/Parsoid#Contacting_us

@Arlolra is there specific time at when I can reach you on IRC because I'm in India and whe have different time zones.
your reply is deeply appreciated.
yours sincerely,
vansh khanna.

I'm usually around during Pacific workday hours, though not this week. @ssastry is in Central.

But maybe email (see the linked contact page) or here on Phabricator is more convenient. Up to you.

@Arlolra My present prototype fetches text files and renders it's html equivalent
I was keen to show you my prototype so far and in the long run this is what that has to be implemented using a webservice in node.js

Sure, just post a link and we'll take a look.

@Arlolra, The code is on my local machine presently and with in next 24 hours I will deploy it on my domain(vanshkhanna.com) so that you actually see a working prototype live.
p.s some issues with my hosting plan.
Or I can mail you the code or post here but I don't see that as a professional way of doing it. Is it cool?
your reply is deeply appreciated,
yours sincerely,
vansh khanna.

@Khannaanant262129 - I think there are a lot of different methods you can use for delivering code. Github is the most mainstream option, but there's a pretty big selection of online code hosting alternatives, including many free (libre) ones. I think using a code hosting service to share your work would be easier for us than getting your work in an email tarball, and learning how to publish to a code hosting service would be a good step in your professional growth.

@Arlolra I am attaching my code and the result. This is a working model of what has to be implemented in this project in the long run. It reads a html file and converts <a> , <i> , <br> and <h1> tags as per Wiki - Markup ( https://en.wikipedia.org/wiki/Help:Wiki_markup ) and saves the output in wikitext.html file. The screenshot below shows the html text and the equivalent wikitext.

your reply is deeply appreciated,
yours sincerely,
vansh khanna.

var http = require('http');
var fs = require('fs');

http.createServer(onRequest).listen(8888);
console.log("server is running....");

content = fs.readFileSync('wiki.html').toString();
replacetags(content);

function replacetags(text){

console.log(text);

// for link tags '<a>' tag to [[ ]]
var linkOpen = text.replace("<a>", "[[");
var linkClose = linkOpen.replace("</a>", "]]");

//for heading tags '<h1'> to '</h1>
var hOpen = linkClose.replace("<h1>" , "=");
var hClose = hOpen.replace("</h1>" , "=");

//for line breaks

var brTagsRemoved = hClose.replace("<br>", "{{break}}");
var removeIOpen = brTagsRemoved.replace("<i>" , "''");
var final = removeIOpen.replace("</i>","''");



console.log(final);

fs.writeFileSync('wikitext.html' , final);

}

//404 response
function send404Response(response){

response.writeHead(404, {"Content-Type" : "text/plain"});
response.write("Error404: Page not found!");
response.end();

}
//Handle request
function onRequest(request,response){

if( request.method == 'GET' && request.url == '/')
{
    response.writeHead(200, {"Content-Type" : "text/plain"});
    fs.createReadStream("./wikitext.html").pipe(response);

//pipe the page through the response object

}
else {
    send404Response(response);
}

}

wiki-parsoid.png (548×1 px, 44 KB)

@RobLa-WMF , your suggestion is deeply admired sir. I'll certainly host the code to my git account.

@Khannaanant262129 - thanks for showing us the code! One smoother way to share it in the future is in the Paste feature of Phabricator. To find it:

  1. Go to our Phab home page: https://phabricator.wikimedia.org/
  2. Choose "Applications" from the left menu.
  3. In the "Utilities" section in the list, find "Paste" (direct link: https://phabricator.wikimedia.org/paste/)
  4. In the upper right corner, select the "Create Paste" button
  5. Fill in the form, including your code. This is what you should see:

Screenshot 2016-03-05 at 4.16.27 PM.png (778×611 px, 56 KB)

I started doing this on your behalf, then realized that you should probably be the one to do it, so I didn't select "Create new paste". However, if I had, your code would appear in a similar way to this example: P2540, and you can put the Paste number inside of curly braces to include it inline in a comment (like "{Pxxxx}"). I hope this helps!

@RobLa-WMF, your help is deeply admired sir. @RobLa-WMF @Arlolra , I have added a paste

1
2var http = require('http');
3var fs = require('fs');
4
5http.createServer(onRequest).listen(8888);
6console.log("server is running....");
7
8
9content = fs.readFileSync('wiki.html').toString();
10replacetags(content);
11
12function replacetags(text){
13
14 console.log(text);
15
16 // for link tags '<a>' tag to [[ ]]
17 var linkOpen = text.replace("<a>", "[[");
18 var linkClose = linkOpen.replace("</a>", "]]");
19
20 //for heading tags '<h1'> to '</h1>
21 var hOpen = linkClose.replace("<h1>" , "=");
22 var hClose = hOpen.replace("</h1>" , "=");
23
24 //for line breaks
25
26 var brTagsRemoved = hClose.replace("<br>", "{{break}}");
27 var removeIOpen = brTagsRemoved.replace("<i>" , "''");
28 var final = removeIOpen.replace("</i>","''");
29
30
31
32 console.log(final);
33
34 fs.writeFileSync('wikitext.html' , final);
35
36}
37
38
39//404 response
40function send404Response(response){
41 response.writeHead(404, {"Content-Type" : "text/plain"});
42 response.write("Error404: Page not found!");
43 response.end();
44
45}
46//Handle request
47function onRequest(request,response){
48
49 if( request.method == 'GET' && request.url == '/')
50 {
51 response.writeHead(200, {"Content-Type" : "text/plain"});
52 fs.createReadStream("./wikitext.html").pipe(response);
53//pipe the page through the response object
54 }
55 else {
56 send404Response(response);
57 }
58}
59
60
61
62
63
64
65
66
67
. Your reviews and suggestions are awaited I'd also want to discuss a couple of good ideas that I have regarding the project.

@Khannaanant262129 I think you should read through the description above again and familiarize yourself with Parsoid. Parsoid already has the ability to serialize from HTML to wikitext (what your replacetags is trying to accomplish). What we'd like to do is have a way for users to input other forms of markup, which will get converted to wikitext. The pipepline would be markup > html > wikitext.

You're sort of on the right track. A first step might be an http server that serves a form where users can input some marked up text (any flavour of markup will do) and then when posted, gets converted to html. Maybe include a dropdown menu where the user can choose the type of markup they're sending and the service recognizes.

@Arlolra , Actually that was step two on the current prototype i.e to input a markup file from the user and its type. I'll store it's path in the database and the file itself in the file system( it would keep the database light and efficient). Then the file is fetched along with another file that will contain the rules of the submitted markup file. And then convert the user-submitted file into HTML and for the rest we have parsoid.

The javascript code looks like this:

1/**
2 * Created by db2admin on 07-03-2016.
3 */
4var express = require('express');
5var app = express();
6var http = require('http');
7var fs = require('fs');
8var fileUpload = require('express-fileupload');
9app.use(fileUpload());
10
11var mysql = require('mysql');
12var connection = mysql.createConnection({
13 host : 'localhost',
14 user : 'vansh_khanna',
15 password : '< MySQL password >',
16 database : 'markup_tags'
17});
18
19connection.connect();
20
21app.listen(3000, function () {
22 console.log('server up.....');
23});
24app.set('views', __dirname + '/views');
25// set the view engine to ejs
26app.set('view engine', 'ejs');
27
28
29// use res.render to load up an ejs view file
30
31// index page
32app.get('/', function(req, res) {
33 res.render('upload',{title : 'wikitext'});
34});
35
36
37app.post('/uploads', function(req, res ) {
38 var markupFile;
39 var name;
40
41 if (!req.files) {
42 res.send('No files were uploaded.');
43 return;
44 }
45
46 markupFile = req.files.uploadFile;
47 name = req.files.uploadFile.name;
48 markupFile.mv('/uploads', function(err) {
49 if (err) {
50 res.status(500).send(err);
51 }
52 else {
53 res.send('File uploaded!');
54 }
55 });
56
57 content = fs.readFileSync(name).toString();
58 toHTML(content);
59
60});
61
62
63
64function toHTML(text)
65{
66 console.log(text);
67
68 // ##here we will fetch the markup syntax from the database depending on the extension of the uploaded file
69 // ##the HTML file so made will be converted into Wiki-Markup document
70
71
72}
73
74

The corresponding HTML looks like this. Just as you suggested

sample-wiki.png (768×1 px, 112 KB)

your suggestions and reviews are deeply appreciated,
vansh khanna.

`

@Khannaanant262129 I recommend that you show up on IRC (#mediawiki-parsoid) and talk through this. @Arlolra should be around this week. @cscott and I will also be around as well but I will be not available as much these next 2 weeks (travelling and busy otherwise). Arlo mentioned his time zone above (Pacific Time).

Here is a small client-side prototype tool for HTML-to-wikitext: http://gwicke.github.io/paste2wiki/

It cleans up the pasted HTML a bit & then calls the Parsoid / RESTBase html2wt API end point to convert to wikitext. The cleaned-up HTML is available for copy&pasting with VisualEditor as well. Maybe this proves useful as an inspiration.

Arlolra triaged this task as Medium priority.Apr 12 2016, 11:24 PM

@Arlolra , @ssastry this was a featured task for GSoC 2016 and still is in good shape. Are you ready to mentor this one for the upcoming Outreachy-13 ( Dec 6-March 6) ? Please let us know so that we can feature this :)

Hi,

I am a final year Computer Science and Engineering undergraduate at University of Moratuwa.
I am familiar with node.js and am currently attempting to gain familiarity with wikitext and Parsoid. I consider doing the above project for Outreachy Round -13 this time. As I understand this project is about developing a webservice that converts phabricator output, markdown, google doc output and word doc output to wikitext making use of Parsoid to do this. Please correct me if I am mistaken.
While learning about wikitext and Parsoid to gain familiarity, please suggest what else I could do to proceed--any related micro tasks that I could fix? I am quite new to wikimedia, but am a quick learner. Any help would be highly appreciated.
Thank you.

Hi,

I am a final year Computer Science and Engineering undergraduate at University of Moratuwa.
I am familiar with node.js and am currently attempting to gain familiarity with wikitext and Parsoid. I consider doing the above project for Outreachy Round -13 this time. As I understand this project is about developing a webservice that converts phabricator output, markdown, google doc output and word doc output to wikitext making use of Parsoid to do this. Please correct me if I am mistaken.

The understanding is correct.

While learning about wikitext and Parsoid to gain familiarity, please suggest what else I could do to proceed--any related micro tasks that I could fix? I am quite new to wikimedia, but am a quick learner. Any help would be highly appreciated.
Thank you.

I suggest you go through https://www.mediawiki.org/wiki/How_to_become_a_MediaWiki_hacker and setup your environment, gerrit and try your hands on some easy bugs, better so from Parsoid, if you want to understand it more.
As for this task, we'll soon have some relevant microtask, if its ready to go.

Thank you! I shall attempt that and get back if I came across any issues

Hi again,

I have set up Gerrit and have gone through the details on Parsoid. I went through the bugs for Parsoid. Can you please suggest some easy task(bug) that I could try out? Also kindly let me know if this project will be mentored for Outreachy or if I should be trying on another project.
Thank you.

Hi again,

I have set up Gerrit and have gone through the details on Parsoid. I went through the bugs for Parsoid. Can you please suggest some easy task(bug) that I could try out? Also kindly let me know if this project will be mentored for Outreachy or if I should be trying on another project.
Thank you.

Hi riyafa, we're not having this project for the current round of Outreachy, you're advised to look at other projects on the Outreachy-13 board.

ssastry lowered the priority of this task from Medium to Low.Nov 7 2017, 12:09 AM
srishakatux added a subscriber: srishakatux.

Adding Outreach-Programs-Projects and removing Possible-Tech-Projects as we are planning on killing that workboard soon!

I am interested in taking up this project for GSoC 2020. I have worked with NodeJS extensively (including both using and creating APIs) and have also published a package to NPM in the past, although not with PHP but I am willing to take it up for this project. I am also interested in working with parsers and the like, to complement a course that I am currently taking. Would love to see this project to completion!

Edit: I apologize for 'waking up' this thread, I was just wondering if this task is still active and how I could take it up :)