Page MenuHomePhabricator

Internet in a Box Enhancement Top Level Project
Closed, ResolvedPublic

Description

NOTE: This project has received fewer contributions and is actively seeking new contributors.

Internet in a Box is a platform to provide offline access to WMF wikis and other content and applications. It is used by Wiki-Project Med as an information appliance for Healthcare Providers.

A list of desired technical enhancements is given at IIAB Enhancements

There are a variety of Skills Required, but the most important are python, javascript, and html5.

Project information is here.

Or view the FAQ.

Event Timeline

What does "Top Level Project" mean? It is unclear to me what this task is about specifically, as it only links to lists instead of defining a specific task to perform.

@Tim-moody Hi! As you want to mentor the project via GSoC, ideally the format of this task adheres to the one mentioned here: https://phabricator.wikimedia.org/tag/outreach-programs-projects/ :) Let us know if you've questions/ concerns!

@Tim-moody Do you have a co-mentor in mind for this project? Ideally, every GSoC project has two mentors. If you need help in finding one, let us know!

I do not have a co-mentor and could use help in finding one. Thanks.

@Tim-moody I had a chat with @psinghal20 and he is interested in being a co-mentor for your project. @psinghal20 worked on the Education-Program-Dashboard as part of his GSoC project with Wikimedia in 2018 and mentored new developers on the same project in the Wikimedia's outreach programs and Hackathon last year. Also, was my co-administrator in the previous round.

I can imagine that some ways in which Pratyush could be of great help – answer prospective students' queries related to the project, help with setting up the development environment, give feedback on proposals, do code review, etc.

I will let you two decide on who can do what :)

Just wanted to follow-up on the previous comment :) @Tim-moody @psinghal20 Did you get a chance to have a conversation about @psinghal20 co-mentoring this project?

Also, I would need both project mentors email address shared with me (ssethi@wikimedia.org) so that I can send an invite to sign up as a mentor on the GSoC site. Looking forward to your response @Tim-moody and @psinghal20 :)

I came in thorugh the GSoC portal! And this piqued my interest! However, I am a bit confused. I understand, that the main objective is to have an IIAB implementation to help serve specific resources from wikimedia. What I dont understand, is the list of desired advancements, supposed to be our goal to implement? They seem to have references to previously implemented instances too, It really feels a bit vague to me.

@BlaineSensei it was my hope that the rather long list of potential projects at https://meta.wikimedia.org/wiki/WikiProject_Med/Tech#Internet-in-a-Box would be a starting point for further discussion. If there is an item on the list that strikes your fancy, I would be happy to fill in more details. The Internet in a Box project itself is not new, but each of the enhancements listed represents a potential project.

Hello @Tim-moody, lets say I wish to pursue the WikEM project. How would I go about that.

The reason why I am asking you this is because although the links given in the task description do suggest great project ideas, there is no idea about the steps and scope of the project. If you could throw some light on that, it would be great.

@Chtnnh WikEM is one of three scraping projects, the other two being cdc.gov and nih.gov. All are similar, though my preference at this time would be nih as it has not been done before and kiwix has a zim of wikem. The goal of the project would be to produce a static html directory that can be rendered by a web server in which all links are relative. Ideally this would be done without any or too many external links that require an internet connection as Internet in a Box is meant to be offline.

There are a number of tools that can be used to scrape the site, such as wget, httrack, custom python with Beautiful Soup, perhaps puppeteer. Each has tradeoffs and knowing one may be a strong recommendation. You need to download pages and assets, but mostly will need to reduce the size of the result. The cdc is close to 1TB in size, but I reduced it to 8G by skipping videos, pdfs, and compressing images. You will often need to use successive approximation as you decide what can be included and what cannot.

In addition to size constraints to manage there are links to fix up and often css or javascript needs fixing for the offline environment.

Finally, the biggest challenge is to make a repeatable process that can be automated in order to scrape the site periodically as it changes. Our scrape of the CDC is now almost three years old.

Thank you so much for the explanation Tim! Speaking of the three web scraping projects in general, I understand that we have a tight space constraint as the IIAB is meant to be offline.

I have a few follow up questions if you don't mind.

  1. Is the web server in question the IIAB?
  2. Once scraped, can we not implement standard CSS throughout IIAB?
  3. What is the amount of JavaScript employed in such pages typically?
  4. How does a contributor decide what content is more valuable in situations of memory constraints?

I also have a few question regarding the other projects mentioned.

"Add search infrastructure to IIAB": Is there no search functionality offered in IIAB currently? If yes, how do users typically navigate through content? If no, then what is the requirement of this project?

"Select and implement alternative usage tracking": What is the current tracking mechanism in place? What is the data analysis used for? Why are we seeking an improvement?

P. S: I love the concept of IIAB! How many such units have been dispatched till date?

@Chtnnh 1. the web server is nginx, which is a standard component of IIAB.

  1. Keep in mind that IIAB is an integration more than an application, so we are dependent on the css and js of others.
  2. For cdc my memory is that it was not extensive, but it varies by content source.
  3. There are a number of MDs and the WikiProjectMed who guide us as to what content to include.

For the reasons in 2. there is no general purpose search. Kiwix has search for wikis, et. al. and a few content modules have builtin search, but there is no search that spans all content.

There has been a lot of interest in search, but the complexity and effort is often underestimated. Kiwix has search, but many web type modules do not. So the first question is whether to merge search results from various sources or reindex all content. Also content is not static. Each implementer can chose from hundreds of items. So the workflow of provisioning content needs to be considered.

awstats is the package used to track usage. There is interest in a more extensive package, but none has been selected. Suggestions are welcome.

I think 500+ have been distributed to date in two varieties, educational and medical. There have been a number of pilots that could lead to broader adoption.

@Tim-moody

Thank you so much on your extensive clarification. It has helped me understand the project idea much better. I am quite interested in helping scrape NIH content to be made available in the IIAB.

I see that the search implementation project is quite challenging.

I am afraid I won't be able to suggest any alternatives to awstats.

That's a good number! I see this is quite an amazing offering.

@Chtnnh If you would like to pursue scraping NiH I think the next step would be to estimate the feasibility of producing an offline version that is a reasonable size, say 3 to 5G. One problem is that we don't know how big a site is without downloading it and it could end up being huge.

So it would help to have a general purpose tool that can:

  1. count total pages within some scope (spidering will tend to run off the site into other sites)
  2. count the number of assets of different types (pdf, image, video, etc) and determine their size if this is possible without downloading
  3. estimate the complexity by determining the extent of javascript or other scripting usage

It may also help to strategize around the presence or absense of an api. For example a site based on mediawiki will have a way to extract pages.

Any other ideas you have are welcome here.

Hello @Tim-moody

My name is Swapnil. I'm a sophomore studying Computer Science at BIT Mesra, India. I'm afraid I'm a bit late to finding this project but I find the concept of Internet In A Box, really fascinating. No wonder, it can prove to be a huge help especially to education and medical communities.

I have gone through the FAQs of the project and also through the list of desired enhancements. I am particularly interested in the Project titled " Wikidata Integration" where we are supposed to create overlays on OSM maps of local important features.

I understand that there are two main highlights to proceed with this project :

  1. Getting data from Wikidata
  2. Creating overlays on OSM maps.

Regarding 1, I was thinking of the various ways of extracting data, like the MediaWiki API or using SPARQL endpoints. And regarding OSM maps, the Overlay API provided by OSM would be the place to go.

Would that be the correct approach?

I would be really thankful if you could help me understand it better.

P.S. Sorry for being late to the party.

@swapnil-sinha, I think this would make a great project, but it will be far from easy.

  1. Yes SPARQL is the best way to get data
  2. Yes, but IIAB already has OSM functionality so you will have to make sure that the Overlay API works with our code or modify our code to accommodate it.
  3. Keep in mind that IIAB is meant to be offline, so wikidata will need to be extracted prior to deployment and stored locally on the server.
  4. The Admin Console has functionality to provision various types of content during the time that an implementer has internet access, and this would become another type of content to include.
  5. To become familiar with what others have done there is a great tutorial at https://wikimania.wikimedia.org/wiki/2019:Libraries/Map_making_workshop_%E2%80%93_from_Wikidata_to_interactive_off-wiki_maps_in_three_steps
  6. You should also become familiar with the IIAB code at https://github.com/iiab

Hi @Tim-moody , yes by no means its an easy project.

Thanks a lot for that explanation!

I went through the tutorial slides mentioned in 5, and saw the project code mentioned in 6.

Since there are no micro tasks given, how should I start with the project?

Is there a feature implementation that I should do on my own, or should I write a proposal right away? I was looking forward to some beginner friendly tasks to get adept with the project.

What's your take on this?

Thanks
Swapnil

@swapnil-sinha, as a first task for familiarization and to verify that this is a doable project, I suggest that you create a prototype in which you take the Libraries in the Netherlands example from the tutorial and modify it to show hospitals within say 50 km of where you live (or any other place not in Netherlands). Please document any problems that would prevent this project from being completed.

Okay @Tim-moody . I'll start working on it ASAP.

Thank you for your time.

@Tim-moody, I also want to take compete for this problem

What test task can I do?

@Cheptil we won't be able to have two people working on the wikidata OSM project. So you might look at the rest of the list of enhancements to see if there is something that interests you, otherwise you can undertake the same task if you really want to make this a competitive situation.

@Tim-moody, I watched enhancement list and the "Content Search" task block is the most attractive for me, because I specialize on informational retrieval(my academic microCV is in gsoc20-outreachy20 zulip stream)

I major in semantic search, and my idea for "Create search indices for commonly used content" subtask is to numerate objects such a way, that objects with close numbers have close meaning - such renumbering can speed up the reverse index compared with random numerated objects(like in classical reverse index)

So I can write a proposal for this idea and its realization or previously perform a microtask to confirm my intent in this direction

Hi @Tim-moody ,

I just made a Wikidata set for hospitals within 50km radius of Delhi, which is the capital of India, where I live.Here is the link to the query :

https://w.wiki/LLv

It is a basic flat map that returns the name and the coordinates of the hospitals. I'm currently working on module 3, to make interactive off-wiki maps and also return the result in an HTML file.

Please tell me what else should I do so as to proceed with the project. It is already getting more interesting for me than it was earlier.

I haven't encountered any problems as such yet.

Looking forward to hearing from you
Swapnil

@swapnil-sinha sounds like you are making progress. Can you get the labels onto the hospitals? How will you store the wikidata for offline use? Keep in mind that this needs to be an overlay as the IIAB has OSM maps pre-rendered. I chose one hospital at random (Mata Cana Devi) and compared google and osm. It is disappointing that the street names are different. (Assuming that I navigated to the same place.)

image.png (390×624 px, 354 KB)
image.png (390×624 px, 245 KB)

@Cheptil, there has been a lot of interest in search, but the complexity and effort is often underestimated. Kiwix (an IIAB component) has search, but many web type modules do not. So the first question is whether to merge search results from various sources or reindex all content, as the latter results in duplication of storage for Kiwix content. Also content is not static. Each implementer can chose from hundreds of items. So the workflow of provisioning content needs to be considered as indexing will need to happen on a server by server basis. Also keep in mind that all of this has to run on a Raspberry Pi, so both performance and storage (typically 128G) need to be considered.

Thanks a lot @Tim-moody

Yes, I am on the labels thing.

I'm sorry but I didn't understand the thing about different streets. The hospital on both the maps look same to me. The street beside the hospital is Lal Sai Mandir Marg in both the maps. Am I missing something?

Screenshot_2020-03-27 Wikidata Query Service.png (745×1 px, 454 KB)

Screenshot_2020-03-27 77°04'41 6 N 28°37'07 9 E.png (757×934 px, 173 KB)

About the data being stored offline, I was thinking of JSON dumps and for the overlay on existing OSM maps, I was reading about OpenLayers JS library. I could really use some help here.

Thanks
Swapnil

@swapnil-sinha, I was referring to the cross street which is Bhagwan Mahavir in osm and pankha in google (obviously osm's fault, not yours). In terms of overlays, we have never done this, so for now let's set the expectation that you will need to study up and then get feedback. You will need to play a design role as well as an implementation role. I can't promise more than that now, but am working on it.

Okay @Tim-moody .
I've started with the necessary reading.
Can I just know what technology is used to get the OSM maps offline? Is there a complete dump, or something else?
I was hoping to build up on that information for my study.

You will need to play a design role as well as an implementation role.

Yeah, the project indeed is interesting , I'll try to do the best I can.

@swapnil-sinha, here is the git repo https://github.com/iiab/maps

You will eventually want to look at the other repos as well, iiab where osm gets installed and iiab-admin-console where map sets get installed.

Hi @swapnil-sinha,
I'm a friend of @Tim-moody and have done work on the IIAB maps.

The main.js at https://github.com/georgejhunt/maps/tree/test/osm-source/viewer has a "drag and drop" interaction which would permit you to drop a geojson file on a vector map.

I found setting up a development environment for openlayers to be very challenging. The setup I use is at https://github.com/georgejhunt/maps/tree/test/generate-regions/pack (it uses webpack -- check the packages.json for dependencies).

More recently (I first started learning openlayers a year ago), the openlayers docs suggest using parcel.

It took me a while to realize that the documentation describes javascript version ES6. But most browsers only understand ES5, and require a compilation step with webpack or parcel, etc.

I suggest you download the vector map for your area of interest from https://openmaptiles.com/downloads/planet/.

It probably will be a little challenging to put all of the pieces in place from the repos I have referenced. The developers of IIAB use ansible to configure the software for the Raspberry Pi. The ansible play that installs the OSM viewer is at https://github.com/georgejhunt/iiab/tree/3map-reorg/roles/osm-vector-maps.

I'm sure you will have questions. I'll be glad to help.

@Tim-moody, sorry, I don't understand that this direction is so complex

I'll try to find something more simplex

Thank you for your time

@Cheptil, there has been a lot of interest in search, but the complexity and effort is often underestimated. Kiwix (an IIAB component) has search, but many web type modules do not. So the first question is whether to merge search results from various sources or reindex all content, as the latter results in duplication of storage for Kiwix content. Also content is not static. Each implementer can chose from hundreds of items. So the workflow of provisioning content needs to be considered as indexing will need to happen on a server by server basis. Also keep in mind that all of this has to run on a Raspberry Pi, so both performance and storage (typically 128G) need to be considered.

Thank you so much for the explanation @Georgejhunt !
I think I am finally getting to understand the basic layout. ( There's still a whole bunch left, though)
It kinda feels a bit intimidating, especially as the deadline for submitting is fast approaching, and I really want to be a part of this project.

After spending almost all my waking hours in the past couple of days, I now understand that this is how basically the system works.The Admin Console lists all the region-wise map data that gets downloaded to the system in the as a mbtiles file. This mbtiles file gets downloaded from openmaptiles, having maps in the vector format, as they prove to be better than bitmaps. Then, these mbtiles ( after being converted from the .pbf format) are served by a php server to the client. On the client side, we use the OpenLayers JS library to render the maps. Now all this is explained in a fantastic manner at :
https://github.com/iiab/iiab-factory/blob/master/content/vector-tiles/Design-Decisions.md

What we need now, are overlays to be shown. Here's my two cents on that :

Firstly, to generate overlays, I think we can use the Vector Layer functionality given by OpenLayers (https://openlayers.org/en/latest/examples/vector-layer.html). We can put our Wikidata query as a tsv file on geojson.io and we can then get a GeoJSON file which can be put on the console page. The implementor can download the file, whether he wants hospitals,schools or any other overlay(along with the mbdata file) to his machine. The geojson file will be stored on a specific 'data' directory for the OpenLayers to access, and from the php server we can retrieve the overlays (as layers) on the OpenLayer page.

The big takeaways from this, according to me would be generating geojson file for every feature and then put them on the Admin console page. Also, I still have to explore how to download the files to a specific directory and feed into the php server. This, I think can be done on a little bit more tinkering from my side.

I would be highly grateful, @Tim-moody and @Georgejhunt , if I could get your valuable feedback.

Thanks

P.S. : I think we can still use ES5 with compilation by Webpack for some more time since we still cannot bet on the browsers that our users will be using, considering the conditions they may be using this. So backward compatibility should be kept in mind, in my honest opinion.

P.S. : With each passing day, as I am understanding the details, the project is starting to look a lot more interesting than before.

@Tim-moody , @Georgejhunt ,
I am still unclear on the Automation and Testing part, done with Ansible. I still have to look into that.

@swapnil-sinha, your summary sounds about right. I'd like to hear what @Georgejhunt has to say, but I think the approach will work.

I think a catalog of pre-calculated overlays, perhaps stored in geojson.io, is a good first pass. (I'm not sure why you chose a .tsv file format rather than json.) Admin Console typically has a catalog of items of a particular type that it renders as a list in order for implementers to make selections. We would need a catalog of urls to pre-calculated overlays, which might need to live outside of geojson.io. We already have some at http://medbox.iiab.me/catalog/.

However, given the large number of combinations of geographical area and features that could be extracted from wikidata, I would like implementers to have a workflow that would allow them to do their own queries and include the results. This might involve external tools such as geojson.io and a clean way to download a json file using Admin Console, which we typically use to hide the details of where things go on the server and how to get them.

I wouldn't worry too much about ansible at this point. Adding this functionality to the playbook will not be difficult or time-consuming.

How does this project feel to you in terms of your skill-set and experience. Is it doable? I think it could be a valuable contribution.

@Tim-moody , thanks for your valuable feedback.

Yes, giving the implementers the option to make their own queries would be actually amazing and would make the entire thing more interactive and user friendly. I'll see what I can do about that.

As I earlier mentioned, with every Google search about a resource, I am getting more involved in the project and all the more interested.

I have limited experience with Webpack, but from the guides I saw today, I'm now more confident that I can do it on an individual level after some practice.

With everything else, things look really well and I'm really excited to begin working on the project.

The last couple of days have truly been an experience and I can't wait how I can prove valuable to Internet In A Box and Wikimedia Foundation, in general.

Would love to hear your thoughts.

I'm new to the project, and don't have much sense of time frame. But I think simple is better.

I discovered that openlayers has a function called "interaction" which creates a very simple drag and drop method for adding data points to a map. The next version of IIAB maps has this drag-and-drop function enabled by default. So the simplest thing, I think, is to create the data points in a format that is accepted by the drag-and-drop interaction. I suggest geojson because there is python library, which I have used, which make the formatting of geojson really easy.

The map system that IIAB is currently shipping involves a single unitary self contained zip file which gets downloaded via the Admin Console for the selected region. That will change in the next release such that the viewer, and all of the referenced assets are installed during IIAB initialization, and the only choice made by the console is which vector tileset to download (a <region name>.mbtiles).

@swapnil-sinha do you have access to a RPi? I think the easiest path would be for you to install IIAB from my development branch that already has the drag-and-drop working. This can be the test bed for output from wikidata.

Regarding getting webpack or parcel development working. I found that I needed to follow the openlayers tutorial, and start really simple. Once you can get a single simple vector map going, I think it will be easy to subset the main.js in https://github.com/georgejhunt/maps/tree/test/osm-source/viewer to get just the map and the interaction going.

I think the challenge will be to develop a tool that facilitates selection of wikidata, and formatting it into geojson. The openlayers tutorials have examples of how to create a popup to display the data which rides along with each data point. If there's a python interface to wikidata, I'd want to explore that.

@swapnil-sinha and @Georgejhunt This all sounds good. I think we should focus on a working prototype and work on integrations (Admin Console) and optimizations (Webpack) later. Since this is incremental functionality, I think webpack may turn out to be too disruptive to existing code.

@swapnil-sinha, what else do you need in order to propose a project?

Thank you @Georgejhunt for the feedback.

@Tim-moody , I've gone through some previous accepted proposal formats. Should I start drafting a proposal based on all that we've discussed so far?

Hi @Tim-moody ,
Should I submit a draft of my proposal as a comment, or as a new task?

@swapnil-sinha, I would prefer a separate task. Perhaps @srishakatux or @psinghal20 could comment on standards.

@Tim-moody , @psinghal20 , @Georgejhunt I've added the first draft of my proposal at T248800 . Kindly have a look.

Thanks

@Tim-moody , is it possible to have a time duration when we can come together to discuss about the project? The time zones are really doing their thing here.

Is everything in this project task planned for Google-Summer-of-Code (2020) completed? If yes, please consider closing this task as resolved. If bits and pieces are remaining, you could consider creating a new task and moving them there.

@ Tim-moody: Could you please answer the last comment? Thanks in advance!