Aug 26 2016
Aug 15 2016
Aug 14 2016
There are now 5 different types of questions:
- Neutral Point of View (NPOV) - Backlog category with section tags - code
- Article needing updating - Backlog category with section and inline tags - code
- Articles containing the word 'recently' - code
- Articles needing copy-edit - Ranked on the basis of Flesch-Kincaid readability score and pageviews - code
- Student edits - ranked on the basis of edit size and pageviews - code
All question creation scripts have the functionality to POST to ask. They have been commented for now. If someone else sets up an instance on their own private server, we can help them out by loading questions into it with POST to /ask.
Commit is here.
I have used the Special:Students page for obtaining student contributor usernames who have made recent edits. I then obtain their most recent edits information such as pageid, revid, size of edit, etc using the Usercontrib API. The final rankings are based on a high value of pageviews and edit size. So bigger edits on frequently visited articles are higher on the list for reviewing.
Code for this script is here.
All scripts now use pageviews data for ranking purposes.
The associated python script that every question-creation script uses is here.
There is a new directory /logs for logging.
There are two types of log files:
- activity-yyyymmdd -- logs info such as token, ip, qnum, qtype, request method after every GET and POST request in /ask, /answer, /recommend
- results-yyyymmdd -- logs qnum, all associated qtypes with modtimes after POST on /recommend
commit is here
Aug 8 2016
Aug 7 2016
This was a big task. Apart from the the deliverables of this task, I have also added username and email validation. Sub-tasks completed are the following:
- Wrote the register() function and register.html templates. Functionality for sending salted SHA-512 of the email address to registered reviewers, which they can use as a token while answering questions
- New directory called 'registered' with filenames as token for each user . File contains user info and files he/she worked on
- Set the token in the session key
- Appended tokens to files when token is set in session key
- Username and email validation. Checked if clashes in the existing registered users
- Checking and setting tokens in new end-point /token
Aug 3 2016
Commit is here.
commit is here
@Jsalsman did you have time to look into this?
@Jsalsman is it really necessary to have separate pages for 'review' and 'tiebreak'? I have added the necessary information in the page for 'answer' itself. Too many pages with the same info might be redundant and confusing.
Aug 2 2016
I was reading up more on browser caching and it turns out there's more to it than meets the eye. After pulling the updated code to ToolLabs, I realized that autocomplete=off doesn't fix the problem on ToolLabs (it only worked locally) and it might not work on all browsers.
@Jsalsman What's answer-i and answer-r? As far as I know, we have answer-e and answer-o for endorse and oppose respectively.
Aug 1 2016
Pressing on back reloads the question because the nextrecord() function returns and subsequently displays a new question. So we will have to erase the answer for another question that was entered. For this to be removed, we have to put an autocomplete="off" attribute in the form tag. I'm verifying whether this works under all circumstances and in all browsers.
Use a secure hash of their email and a salt, I.e. Base64(SHA-512("some 'salt' string"+emailaddr))) and store it in the session cookie like the "flash" messages which show feedback, and use it to populate a new field in the /answer form so that once they enter it, it stays until they clear or change it.
No, it didn't. It was the same uniode decode error which went after setting
the system to default to using utf-8
Jul 30 2016
@Jsalsman , I have now fixed the error.
@Jsalsman , I reverted back to the earlier version to check if the UnicodeDecode error still persists. Like I said earlier, it still does.
Jul 29 2016
The script extracts articles in articles needing copy edit. It then computes the Flesch-Kincaid readability scores as well as gets the page view counts of them. It then standardizes each of these two adds them to get the final contribution. Final rankings are based on high page view counts and poor readability scores. Top 20% are retrieved and made into questions.
Commit is here.
Related python scripts are:
copy_edit.py - main script that does the computations and generates questions
syllables_en.py - helper script to get syllables in a piece of text
utils.py - helper script to get words in a sentence, syllable count, sentence count, etc
copy_edit_ranking.pkl - generated pickle file of final article rankings. This is stored in the form of an ordered dict like:
dict[title] = [link, pageview, fk score, added standardized score]
Jul 25 2016
Jul 19 2016
Jul 18 2016
@Aklapper , we were wondering how hyperlinking of urls is handled on Phabricator. It seems to catch most urls. I am looking for a regex pattern that's stronger than the one we are already using in our project . Could you direct me towards any useful resources in this regard?
Jul 17 2016
The script that accesses the API and displays page view count for the past month can be found here.
Run it as :
Labs instance has been up and running for a while : http://tools.wmflabs.org/arowf/
This task was for generating questions from WP:Backlogs. I have created a new task for the next task: T140559
Commit can be viewed here.
Jul 12 2016
Is there any way to get a list of all article names in Wikipedia? I need to test the script on a huge bulk of articles to find the ones containing 'recent' or 'recently'.
Jul 11 2016
I'm asking around for help. Will let you know as soon as I get it running.
Jul 5 2016
Jul 4 2016
@Jsalsman , what else needs to be done here?
This task is done.
This task is done.
Jul 3 2016
I'll work on this task after a few others are completed.
My Labs request got accepted.
Jul 2 2016
BeutifulSoup is a python library that makes it very easy extract data out of html
I updated the script to reflect the changes. It now extracts articles from a given backlog category and month and makes questions out of all of them
@Jsalsman , is there anything else to be done to mark this task as complete?
Jun 30 2016
@Jsalsman Here's a simple script that scans a backlog page and outputs all article titles along with links on that page.
Just to be sure, why do we need the script to be run via ask(). Ask can just handle user-typed inputs and render them in iframes
Can't we just run the script externally, so that queries directly form question files in the records directory so that they can cater to answer()?
@Jsalsman , should I close this as resolved? This refactoring has long been done.
Jun 29 2016
I converted all dates to Julian dates (inspiration from here https://github.com/phn/jdcal) and found the mean. (I ignored the time though)
I am now working on finding the median in an efficient manner. A search suggested that https://en.wikipedia.org/wiki/Median_of_medians and https://en.wikipedia.org/wiki/Selection_algorithm work in O(n) time.
Completely shifted over to file based data management.
@Jsalsman wrote: we aren't going to require user registration or logins. Reviewers participating in reputation management systems will sign their review comments with a unique identifier. The failure mode of impersonation can be overcome by having that unique identifier be, include, or point to email to which confirmation is sent.
Jun 27 2016
The wikiwho code now outputs the time for each revid as well (in json format)
Check this particular output file for an article called 'Kendhoo'
Jun 26 2016
Hadn't we decided that I'll work on this once the review system is ready? Right now we are working on tasks that fall in the July 4 - Aug 3 category in the schedule here T129536. So 'obtaining article dumps' and 'manual list based input' have been pushed to post whatever we're working on now.
Jun 23 2016
@Jsalsman This design makes a lot of sense in vlermv! I will start coding this. I will get the review system completed before the password and username part that you mentioned in a later email.
Jun 22 2016
Jun 14 2016
Are the links provided the only place to learn more about this database? I searched the web but couldn't find any other sites for reference, including stackoverflow. Although the db might be good enough for all the points that you listed, I'm afraid there isn't an active developer community built around it. Where do I look to if I run into errors? And what is the guarantee that this db will have support and maintenance in the future? Or is there good enough community support that I'm not aware of? Experimenting with a software that is at its nascent stage for something as big as a database might be a little too risky. Your thoughts?
Jun 13 2016
Could we not move from SQLite to MySQL? The error is only due to SQLite's limitation, which will not move over to MySQL.
How do you feel about switching to NoSQL? I can see that will solve a huge number of your schema problems. Please let me know your thoughts on http://stackoverflow.com/a/2581859
Which format of NoSQL would be suitable? Document, columnar, key-value or graph type? And why?
I was going over the use cases of SQL and NoSQL. Apparently, NoSQL is really helpful for scaling out and if we have a lot of data. SQL on the other hand provides more consistency and makes it easier to carry out indexed queries. Please see http://stackoverflow.com/questions/2559411/sql-mysql-vs-nosql-couchdb
Please reply here with commented INSERT TABLE statements for the SQL schema you have so far. Moving to NoSQL will solve your schema upgrade problem. I will help you refactor to NoSQL. I promise I will put in as many hours as that takes this week and/or next.