Page MenuHomePhabricator

Port MediaWikiAnalysis to SQLAlchemy
Closed, DeclinedPublic

Assigned To
None
Authored By
jgbarah
Oct 1 2015, 9:56 PM
Referenced Files
F2730722: mediawiki_analysis.py
Oct 17 2015, 12:52 PM
F2730729: database.py
Oct 17 2015, 12:52 PM
F2724356: mediawiki_analysis.py
Oct 15 2015, 6:28 PM
F2707973: database.py
Oct 11 2015, 9:34 AM
F2707974: mediawiki_analysis.py
Oct 11 2015, 9:34 AM
F2661511: db.py
Oct 6 2015, 8:56 AM

Description

SQLAlchemy is a Python ORM, that could simplify the relationship of MediaWikiAnalysis with the database, and make it in fact database-type-independent.
The codebase can be found at https://github.com/MetricsGrimoire/MediaWikiAnalysis

This tasks consists on porting MediaWikiAnalysis to use high level SQLAlchemy (at the level of the ORM) instead of the current MySQLdb package. As an example, if needed for inspiration, the code of Sibyl, another MetricsGrimoire tool which already uses SQLAlchemy, can be browsed (look mainly at db.py).

$:user\> grep -r MySQLdb .
./mediawiki_analysis.py:import MySQLdb
./mediawiki_analysis.py:    con = MySQLdb.Connect(host="127.0.0.1",
./test/test_mediawiki.py:import MySQLdb, os, random, string, sys, unittest
./test/test_mediawiki.py:        MediaWikiTest.db = MySQLdb.connect(user=Config.db_user_out, passwd=Config.db_password_out)
./test/test_mediawiki.py:        MediaWikiTest.db = MySQLdb.connect(user=Config.db_user_out, passwd=Config.db_password_out, db=Config.db_database_out)
$:user\>

Event Timeline

@jgbarah, since I am supposed to complete atleast one microtask before the application deadline, I think this one can be done in that much time. Do you think I should start with this one?

This one needs some knowledge about Python, and will require you to learn about SQLAlchemy. Fortunately, there is plenty of documentation about SQLAlchemy, and the tasks needs little analysis: it is doing the same as the tool does now, but using SQLAlchemy.

So, I would say that if you at interested in learning about the most popular Python ORM library, this is a good try. This said, you can start and try. If you enter into too much trouble, maybe you can refocus on some of the other microtasks.

@jgbarah, I'll start with this one. Sounds engaging and I will get to learn plenty from it. So my task here would be to change the code in mediawiki_analysis.py to replace the usage of MySQLdb with SQLAlchemy?

As this is a microtask for an existing Possible-Tech-Projects ( T89135 ) and isn't something that can be proposed as a separate project, I am removing the Possible-Tech-Projects tag.

@jgbarah, I'll start with this one. Sounds engaging and I will get to learn plenty from it. So my task here would be to change the code in mediawiki_analysis.py to replace the usage of MySQLdb with SQLAlchemy?

Yes. Have fun!

@jgbarah, I'll start with this one. Sounds engaging and I will get to learn plenty from it. So my task here would be to change the code in mediawiki_analysis.py to replace the usage of MySQLdb with SQLAlchemy?

Yes. Have fun!

Hi. So, I have managed to get a hang of how this works. I went through a tutorial on SQLAlchemy and the Sibyl code. The creation of the database, and commits to it are in sibyl.py and the relational mapping is in db.py. Should I follow the same approach, i.e. make one file for the mapping and the other for the rest of the code, or simply make changes to mediawiki_analysis.py?

Hi. So, I have managed to get a hang of how this works. I went through a tutorial on SQLAlchemy and the Sibyl code. The creation of the database, and commits to it are in sibyl.py and the relational mapping is in db.py. Should I follow the same approach, i.e. make one file for the mapping and the other for the rest of the code, or simply make changes to mediawiki_analysis.py?

I would do it "a-la-Sibyl", with two files, since that exposes better the db structure, and let other programs import it, if needed.

Hi, I have a doubt. The dbms we will be using in the backend is till MySQL, right? So, I should continue using MySQL datatypes? Sibyl seems to be using mysql.,

Hi, @jgbarah, this is the what the mapping file looks like. Let me know if this is fine.

Hi, @jgbarah, I am facing a problem in running the original mediawiki_analysis.py. I am writing this in the terminal "./mediawiki_analysis.py --database mwdb --db-user root --url https://en.wikipedia.org/w" and getting "./mediawiki_analysis.py:141: Warning: Out of range value for column 'date' at row 1". What should I do about it. The values in the table seem to be getting updated though

Hi, I have a doubt. The dbms we will be using in the backend is till MySQL, right? So, I should continue using MySQL datatypes? Sibyl seems to be using mysql.,

Yes. For now, we're stuck to MySQL (or MariaDB, for that matter). In the context of this microtask, nothing more than that is needed.

Hi, @jgbarah, this is the what the mapping file looks like. Let me know if this is fine.

At first glance, it seems reasonable to me. Let's see if it works ;-)

Hi, @jgbarah, I am facing a problem in running the original mediawiki_analysis.py. I am writing this in the terminal "./mediawiki_analysis.py --database mwdb --db-user root --url https://en.wikipedia.org/w" and getting "./mediawiki_analysis.py:141: Warning: Out of range value for column 'date' at row 1". What should I do about it. The values in the table seem to be getting updated though

Hmmm. That's a bit weird. Let's forget about that for now, if tables are being filled smoothly. But I will be having a look at it.

Hi, @jgbarah, I made the changes in the files.

, I am getting this error "UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)" when I run "./mediawiki_analysis.py --database mdb --db-user root --url https://www.wikipedia.org/w" . I am trying to figure out why. It has something to do with the utf-8 encoding.

There seems to be a problem with the print statement in the insert_page function, line 101. It seems to be working if I am using only pageid in the print statement.

Hi, @Anmolkalia,

I've run your code, and I don't see that unicode error you see:

$ python mediawiki_analysis.py --database mdb --db-user jgb --db-password XXX --url https://www.wikipedia.org/w
None
Namespaces query: https://www.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces&format=xml

And it is there solid, while I see the Wiki_pages growing.

In any case, the trouble could be with unicode characters in Wikipedia pages or something. Could you try to reproduce the error in a concrete case? (or at least let me know for how long I have to run the tool to see the error).

Just in case, some recommendations to try to fix the trouble yourself:

def stdout_utf8 ():
    """Set utf8 codec for stdout.

    This is in fact a trick to make the script work with unicode content
    when using pipes (pipes confuse the interpreter, which sets codec to None)
    More info:
    http://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python

    """

    sys.stdout = getwriter('utf8')(sys.stdout)
  • For creating the engine to connect to MySQL avoiding unicode trouble:
# To set Unicode interaction with MySQL
# http://docs.sqlalchemy.org/en/rel_0_9/dialects/mysql.html#unicode
trailer = "?charset=utf8&use_unicode=0"
database = self.url + trailer
engine = create_engine(database,
                       convert_unicode=True, encoding='utf8',
                       echo=echo)

Hope this helps. Please, let me know if you need further help.

In fact, after a while, I get an error with your code:

$ python mediawiki_analysis.py --database mdb --db-user jgb --db-password XXX --url https://www.wikipedia.org/w
None
Namespaces query: https://www.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces&format=xml
Traceback (most recent call last):
  File "mediawiki_analysis.py", line 290, in <module>
    else: process_all(cursor)
  File "mediawiki_analysis.py", line 260, in process_all
    process_all_namespace(cursor, ns)
  File "mediawiki_analysis.py", line 235, in process_all_namespace
    con.commit()
NameError: global name 'con' is not defined

Which seems reasonable to me, because in that line, you reference to "con", but it seems there is no "con" defined at that point. Maybe you should refer "cursor"?

Hi, @jgbarah, I am getting that error if I run the code again, basically when the database already contains the values that are being retrieved. Meaning, the print statements in the except part of insert_page function is giving this error. I'll go through what you sent and get back to you. Thank you for the help.

There seems to be a problem with the print statement in the insert_page function, line 101. It seems to be working if I am using only pageid in the print statement.

What do you mean here exactly? In your code I read:

print (pageid+" "+ namespace + " " + title+" was already in the db")

Do you mean that

print (" "+ namespace + " " + title+" was already in the db")

doesn't work or what?

In fact, after a while, I get an error with your code:

$ python mediawiki_analysis.py --database mdb --db-user jgb --db-password XXX --url https://www.wikipedia.org/w
None
Namespaces query: https://www.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces&format=xml
Traceback (most recent call last):
  File "mediawiki_analysis.py", line 290, in <module>
    else: process_all(cursor)
  File "mediawiki_analysis.py", line 260, in process_all
    process_all_namespace(cursor, ns)
  File "mediawiki_analysis.py", line 235, in process_all_namespace
    con.commit()
NameError: global name 'con' is not defined

Which seems reasonable to me, because in that line, you reference to "con", but it seems there is no "con" defined at that point. Maybe you should refer "cursor"?

Oh, yeah, I realised that mistake a bit later and updated it. Here is the new file.

.

There seems to be a problem with the print statement in the insert_page function, line 101. It seems to be working if I am using only pageid in the print statement.

What do you mean here exactly? In your code I read:

print (pageid+" "+ namespace + " " + title+" was already in the db")

Do you mean that

print (" "+ namespace + " " + title+" was already in the db")

doesn't work or what?

Yes, if I remove print (pageid+" "+ namespace + " " + title+" was already in the db") and replace it with print (pageid+" was already in the db"), I don't get this problem. The problem is with the encoding of the title, I suppose.

BTW, to easy the process of reviewing the code, maybe you can fork the MediaWikiAnalysis GitHub repository, point me to your fork, and I just clone it. That way I can follow your changes more easily.

Thanks!

Yes, if I remove print (pageid+" "+ namespace + " " + title+" was already in the db") and replace it with print (pageid+" was already in the db"), I don't get this problem. The problem is with the encoding of the title, I suppose.

Is it an issue with unicode? If so, try the trick above, stdout_utf8(), to set up stdout to "understand" utf8...

Hi, @jgbarah, I am getting that error if I run the code again, basically when the database already contains the values that are being retrieved. Meaning, the print statements in the except part of insert_page function is giving this error. I'll go through what you sent and get back to you. Thank you for the help.

Try too ensure that you're reading and storing properly unicode in the database (second part of the code in my comment above) and that stdout has no issues with utf8 (first part of the code). I'm pretty much sure that's the issue...

If you cannot fix that, you can also skip the print of the title, and come back to that later. But maybe it is better that you take advantage of the opportunity of mastering unicode in Python 2.x, which is a bit weird...

Hi @jgbarah. I compared the databases obtained with the original code and the one I updated. The values getting updated in the databases are the same. The difference is when I use "SELECT * FROM mwdb.wiki_pages WHERE title NOT IN (SELECT title FROM mdb.Wiki_pages);", I get "ERROR 1267 (HY000): Illegal mix of collations (utf8_general_ci,IMPLICIT) and (utf8_unicode_ci,IMPLICIT) for operation '='". So there is some problem with the collation of the tilte column. I will try to find a workaround.

Okay, I was able to fix that. The databases are exactly the same. So, there is no problem with storage of the data.

Alright, so that encoding problem is also there in the old code. I a trying to figure out why.

I found a not so elegant solution to it. So the code is working fine now. I will try to come up with a better solution to the problem.

So, I was able to correct the code. This is working fine now.

, I have solved all the bugs I came across.
I hope this works fine. Do let me know if something else is required.

So, I was able to correct the code. This is working fine now.

, I have solved all the bugs I came across.
I hope this works fine. Do let me know if something else is required.

Thanks!

I'll check asap. Meanwhile, could you set up that git repo, forked from the MediaWikiAnalysis repo, with your modifications? That would allow us to deal with this as a pull request too, later on.

I'll get that done by tomorrow.

So, I was able to correct the code. This is working fine now.

, I have solved all the bugs I came across.

I'll check asap.

@jgbarah: Has that happened in the meantime?

@Anmolkalia are you still doing this ? Can I continue it or take over if you've stopped ?

@Anmolkalia are you still doing this ? Can I continue it or take over if you've stopped ?

Hi. Actually, I had made the requisite changes and updated the github repository for the same. If there are further changes required, I would be keen to make them. But help is welcome :)

Aklapper updated the task description. (Show Details)

@Anmolkalia: I am resetting the assignee of this task because there have not been signs of progress lately (please correct me if I'm wrong).
Resetting the assignee avoids the impression that somebody is already working on fixing this task and it also allows anybody else to potentially work towards fixing this task. Thanks for your understanding! :)

@Anmolkalia: I am resetting the assignee of this task because there have not been signs of progress lately (please correct me if I'm wrong).
Resetting the assignee avoids the impression that somebody is already working on fixing this task and it also allows anybody else to potentially work towards fixing this task. Thanks for your understanding! :)

If somebody wants to work in this importer tool, please take a look to the perceval backend that it is more robust and it is what we are using in production:

https://github.com/grimoirelab/perceval/blob/master/perceval/backends/core/mediawiki.py

Declining this task as MediaWikiAnalysis has been superseded by the mediawiki.py script in the Perceval backend.