Page MenuHomePhabricator

Outreachy 32: Addressing the lusophone technological wishlist proposals - Create a Python script to get and print the `status code` of the response of a list of URLs from a `.csv` file.
Open, Needs TriagePublic

Description

IMPORTANT: Make sure to read the Outreachy participant instructions and communication guidelines thoroughly before commenting on this task. This space is for project-specific questions, so avoid asking questions about getting started, setting up Gerrit, etc. When in doubt, ask your question on Zulip first!

This is the second task for T418284: Outreachy 32: Addressing the lusophone technological wishlist proposals

Objective of the task: Create a Python script to get and print the status code of the response of a list of URLs from a .csv file.
Steps:

  1. You should create an account in Github, if you don't already have one. You can do so at https://github.com/signup.
  2. You should download this CSV input file:
  3. Based on the input provided, create and write your python code.
  4. Your python code needs to get the urls from the file and print their status code in the following format:
    1. (STATUS CODE) URL
    2. e.g. (200) https://www.nytimes.com/1999/07/04/sports/women-s-world-cup-sissi-of-brazil-has-right-stuff-with-left-foot.html
  5. Commit your Python file to GitHub and send us a link to it by email to tecnologia AT wmnobrasil.org with subject [Outreachy] <your username>
  6. Make sure to also register it as a contribution on the Outreachy website!

Note: applicants that submitted both tasks until Monday, April 6th, 4pm UTC, will receive feedback until Friday, April 10th, so that they can improve their application. The project is closed to new applications.

Event Timeline

Arcstur updated the task description. (Show Details)
LGoto renamed this task from Outreachy 32: Addressing the lusophone technological wishlist proposals - Task 2 to Outreachy 32: Addressing the lusophone technological wishlist proposals - Create a Python script to get and print the `status code` of the response of a list of URLs from a `.csv` file..Feb 27 2026, 9:10 PM

Hello!
I'm an Outreachy applicant and I've been looking into the codebase to prepare for the contribution period.
I have set up my environment and am currently researching the T418286 task. I’m planning to submit my first patch as soon as the contribution period officially opens on March 16th.
Thanks!

Tishly78 subscribed.

Hello Arcstur. I am an Outreachy applicant and would like this task to be assigned to me.

Hello @Arcstur and @Ederporto,

I have completed both microtasks (T418285 and T418286) and submitted them via email.

I really enjoyed working on the tasks, especially handling real-world cases like URL validation and error handling.

Please let me know if any improvements are needed. I would be happy to refine my work further.

Thank you!

Hello @Arcstur and @Ederporto

I have completed the microtasks and submitted them via email. I am looking forward to your feedback.

Thank you

Good afternoon, @Arcstur and @Ederporto.

I have submitted the microtasks and am looking forward to getting your input.

Hello @Arcstur and @Ederporto. I have completed my microtasks and submitted them via email. I look forward to hearing back.

Hello @Arcstur , @Ederporto . I have completed the second task(T418286) as well, and mailed with the link. Hoping to get feedback soon! Can you provide pointers to what we should do next meanwhile.
Thank You!

Good morning @Arcstur @Ederporto, I have completed both tasks, I am patiently waiting for feedback.

Goodday @Arcstur and @Ederporto

I have completed the second task T418286. I am looking forward to working with the community. Thank you.

Hello @Arcstur @Ederporto , I have completed both tasks. Awaiting your feedback.

hello @Arcstur @Ederporto
I have completed both the tasks and emailed the same. Looking forward to your feedback.

Hello @Arcstur @Ederporto

I’ve completed the initial microtasks and submitted them via email, and also shared my work on GitHub.

I’m ready to continue and would appreciate any guidance on the next tasks. Thanks!

Hello @Arcstur @Ederporto
I’ve completed the initial microtasks and submitted them via email, and also shared my work on GitHub
Looking forward to your feedback.

Hello @Arctur @Ederporto
My name is Halima Muhammad Muktar, an Outreachy applicant interested in this project.
I’m currently setting up my environment to start working on the microtasks (T418285 and T418286). I will begin with the Python task (T418286) shortly.
I’m very eager to contribute and learn, and I’ll keep you updated on my progress.
Thank you!

@Arcstur @Ederporto
Hello mentors,

I completed microtask T418286. My Python script reads URLs from a CSV file and checks their HTTP status codes using the requests library

Here is the Replit project link: https://replit.com/@halimamuktar/outreachy-t418286#main.py

Please let me know if you have feedback or suggestions for improvement.
Thank you!

Hello @Arcstur @Ederporto,

My name is Azeezat Oladunni an Outreachy applicant making contribution to the Addressing the lusophone technological wishlist proposals project. I have completed both microtask (T418285 and T418286) and submitted them via email. I have also registered it as a contribution on the Outreachy website.

I would really appreciate any feedback when you have the time.

Thank you.

Hello mentors,

I have completed this microtask and submitted my solution via email yesterday.
Looking forward to your feedback whenever convenient.

Thank you!

I have just completed task 2. i recommend running your script against the full csv before submitting..because some URLs behave in unexpected ways and it's worth seeing how your code handles them

I liked figuring out how to handle the dead links without crashing the script. I'm assuming this kind of URL handling logic is similar to what we'll need for Wishlist #3 when the Visual Editor checks if a reference URL has already been used in an article... Looking forward to the next steps.

Hi,

Thank you for the feedback! I have now tested my script on the full CSV dataset provided in the task. I improved handling for edge cases such as timeouts, invalid URLs, and connection errors so that the script runs without crashing.

Additionally, I updated the script to store the results in a CSV file for better usability.

I would appreciate any further feedback when you have time.

I liked figuring out how to handle the dead links without crashing the script. I'm assuming this kind of URL handling logic is similar to what we'll need for Wishlist #3 when the Visual Editor checks if a reference URL has already been used in an article... Looking forward to the next steps.

You make a good point, the URL handling we built, is related to what Wishlist #3 needs.
But duplicate detection is trickier than just comparing URLs, because different identifiers, (DOI, URL, ISBN ) can point to the same source.

For cases where the same identifier type is used twice (like two DOIs written in slightly different formats), simple cleaning and normalization can catch the duplicates. But when two or more different types of identifiers points to same source, normalization isn't sufficient. We'd need to know what each one actually points to before we can call them duplicates.

So I think the approach for Wishlist#3 would need to work in layers, first normalize within each identifier type (DOI, ISBN, URL each have their own cleaning rules), then handle cross-type matching.

Would be better to confirm with the mentors @Arcstur @Ederporto how far the cross-type matching needs to be handled.

I liked figuring out how to handle the dead links without crashing the script. I'm assuming this kind of URL handling logic is similar to what we'll need for Wishlist #3 when the Visual Editor checks if a reference URL has already been used in an article... Looking forward to the next steps.

You make a good point, the URL handling we built, is related to what Wishlist #3 needs.
But duplicate detection is trickier than just comparing URLs, because different identifiers, (DOI, URL, ISBN ) can point to the same source.

For cases where the same identifier type is used twice (like two DOIs written in slightly different formats), simple cleaning and normalization can catch the duplicates. But when two or more different types of identifiers points to same source, normalization isn't sufficient. We'd need to know what each one actually points to before we can call them duplicates.

So I think the approach for Wishlist#3 would need to work in layers, first normalize within each identifier type (DOI, ISBN, URL each have their own cleaning rules), then handle cross-type matching.
Would be better to confirm with the mentor how far the cross-type matching needs to be handled.

Oh that's a really nice catch, I hadn't thought about the cross type matching problem. So two references could use completely different identifiers but still point to the same source. The layered approach you described makes sense.

Hi,

Thank you both for the detailed explanation. This clarifies the complexity around duplicate detection, especially when different identifier types like DOI, URL, and ISBN point to the same source.

The layered approach of normalizing within each identifier type and then handling cross-type matching makes a lot of sense. I will keep this in mind while thinking further about Wishlist #3.

Looking forward to exploring this further and contributing more.

Hello @Arcstur @Ederporto,

My name is Ndinae Khalushi. I have completed both microtasks: T418285 and T418286, submitted them via the email. I will be submitting them as contributions on the Outreachy platform too

Thank you for your engagement.

Hello @Arcstur and @Ederporto I'm kindly following up on this "Note: applicants that submit both tasks until Monday, April 6th, 4pm UTC, will receive feedback so that they can improve their application".

Hello @Arcstur and @Ederporto and fellow contributors!

My Name is Alexis , I am from Los Angeles , CA. and I am extremely excited to be a part of a project that assist the Developer Community in Brasil!

I have quite a few ideas that I am working on contributing while awaiting the Outreachy Internship Announcement for the | May 2026 Cohort!

  • In addition to the assigned microtasks.

I have begun working on the first Microtask TASK 1 posting my initial commit two weeks ago However there is small error in that the date has registered one day before the correct date from the source file
article-info.html

Nonetheless I wanted to honor the April 6th , 4pm request for microtask submission to to the best of my availability
and I will submit an as is file and summary email as requested. Considering I am about 6 months in to making open source contributions and front end code development , as a completely self-taught learner.

  • View my Initial Commit

github.com/stx-pro/wikimedia

  • [.gitnore] and - Creative Commons Licensing file to plan for further iteration .

I will post more details about how I plan to contribute to the Project Overview

Contributing to Global & Local Digital Equity

My Content Pillars Center around

  1. Networking
    • Career Plans for Web Accessibility auditing
  2. Developing
    • Online Resource Navigation Personal Projects ,
  3. Advocating
    • The use of these Digital Assets in my Community

and others that face similar widespread systemic issues and poverty.

"Digital literacy is the ability to find, evaluate, and communicate information through various digital platforms." — NDIA

I am Currently Working on Publishing Valueable Information , Tools & Resources

Moving beyond "raw code" to understanding high-level systems architecture and digital entrepreneurship.

Alignment Methods

Grassroot Foundations in App Development

In brief effort to demonstrate my front end skills , here are a few deployments I have launched in the One-Complier IDE Sandbox
Please feel free to them check out!

Extension Workshop - Mini-Complier tool
assists with deployment of chrome extensions via manifest.v3 stack iteration
**EXT-CORD**"

Drafted DOM/ASCII Guided Area
uses a Google Sheets sync concept to build tokenization elements
combability enabled with a Canva Design Suite via webhook scenario
"CONFIG-ASCII"

Side Panel Notes with Storage Bin Options
Markdown Scratch Pad and Tree Diagram Modeling
"TOGGLE-TREE"

Back to the Wiki Media Projects

I found a free article via Codecademy that I believe will help me correct my Error in Task 1
Formatting DATES in JavaScript

Python MicroTask Strategy
I am planning to run the source file .csv
directly in the Python Terminal where I can import a data class
specification module to the url objects create properties for the current status code associated the list of items.
I am looking into using a similar method to correct my git hub commit file.

to import

the data classes module
I found this method* - from the Python Help Docs** website listed in the terminal.

Instance method objects have attributes, too: m.__self__ is the instance object with the method m(), and m.__func__ is the function object corresponding to the method.

#scopes

Python code that expects a particular abstract data type can often be passed a class that emulates the methods of that data type instead.

#tutorial

For instance, if you have a function that formats some data from a file object, you can define a class with methods read() and readline() that get the data from a string buffer instead, and pass it as an argument.

The descriptor model binds classification methods to instances enabling the HTTP localized state reading

(status_code)

View the Library Here if you might find it helpful as well!

I will update these methods tonight in my GitHub Repo and ??Upcoming Blog Updates//

Hello @Arcstur and @Ederporto,

Thank you for your feedback.

I have updated Task 2 based on your suggestions.

  • Replaced GET requests with HEAD requests for better efficiency
  • Added allow_redirects=True to properly handle redirects
  • Improved exception handling by including specific error types such as Timeout, ConnectionError, and InvalidURL
  • Included exception names in the output for better clarity

I have also fixed the issues mentioned in Task 1.

Repositories:

Thank you for your guidance.

Best regards,
Ayush Khati

Hey @Arcstur and @Ederporto ,
Thank you for your feedback . I really appreciate it.
I have made all the suggested changes:
Task 1 : Now it respect the different timezones and format date according to it.
Task 2 :

  • Replaced GET from HEAD to get the status code to increase the speed and efficiency.
  • Added allow_redirect = True to report the destination status instead of intermediate response like 301.

Here is my repository link : https://github.com/vikashsiwach/outreachy-wikimedia-application

Thank you again for your feedback and I welcome any further suggestion .

Hello everyone, feedback was sent for those that submitted their tasks up to Monday, April 6th, 4pm UTC. There won't be a second feedback. If you did not do it yet, remember to submit your final application in the Outreachy website before April 15th at 4pm UTC.