Page MenuHomePhabricator

Wikimedia Technical Conference 2019 Session: Self-service Stateless Microservices (for APIs)
Closed, ResolvedPublic

Description

Session

  • Track: Deploying and Hosting
  • Topic: Self-service Stateless Microservices (for APIs)

Description

Right now any new service needs to go through a series of steps to get in production, which make sense for long term projects but not for epxeriments or small services that just transform data from one source and are not in the critical path to serve the wikis.

For such smaller projects, we could think of creating an ever more streamlined system that our current one, basically allowing people to register lambda services to an API to run in production in a self-service fashion. Such a system would allow much faster experimentation and iteration, but will need to be well defined and scoped.

Questions to answer and discuss

Question: Is there broad interest in such a system?
Significance: Given engineering such a simplified system requires a sizeable amount of work, it only makes sense if there is widespread interest around the idea.

Question: What would be the scope of such self-service microservices?
Significance: Defining what scope such simplified services could be applied for is fundamental to understanding if we can strike a good balance between stability and soundness of the architecture and speed of development. If we're too strict in what we allow, the system might be seldom used. If we're too liberal, it might create a tower of babel no one has a comprehensive understanding of.

Question: Which parts of our usual process could be bypassed? ?What are the connected risks?
Significance: While some parts of our current process, like creating a new service request, a load-balanced endpoint, a deployment-chart, a pipeline-based project can probably be removed from the equation without harm, we might still want to have some basic security review of the software, or to have some architectural rubberstamp to the idea. We should discuss which parts of the current process we consider unavoidable even for experiments and small lambda functions.

Question: What an interface to such a system should look like to the user?
Significance: We need to understand what the potential users - the developers! - would find attractive as an interface to such a system. Maybe being able to point the system to a git repository and expect it to just work(TM) in a CD fashion? an API to upload files for callbacks?

Question: How does this compare to WMCS?
Significance: WMCS has been the de facto place for such experiments up to now. Which usages for destined for WMCS and which are not?

Related Issues

  • ...
  • ...

Pre-reading for all Participants

  • [add links here]

Notes document(s)

https://etherpad.wikimedia.org/p/WMTC19-T234646

Notes and Facilitation guidance

https://www.mediawiki.org/wiki/Wikimedia_Technical_Conference/2019/NotesandFacilitation


Session Leader(s)

  • Giuseppe Lavagetto

Session Scribes

Session Facilitator

  • Aubrey

Session Style / Format

  • [what type of format will this session be?]

Session Leaders please:

  • Add more details to this task description.
  • Coordinate any pre-event discussions (here on Phab, IRC, email, hangout, etc).
  • Outline the plan for discussing this topic at the event.
  • Optionally, include what this session will not try to solve.
  • Update this task with summaries of any pre-event discussions.
  • Include ways for people not attending to be involved in discussions before the event and afterwards.

Post-event summary:

  • ...

Post-event action items:

  • ...

Event Timeline

debt created this task.Oct 4 2019, 3:36 PM
TheDJ added a subscriber: TheDJ.EditedOct 12 2019, 2:41 PM

I really like this idea. Being more flexible in experimentation and rapid deployment seems important to me.

I've personally thought about similar strategies for MediaWiki frontend, where you could have a combination of Special pages and Gadgets, published as 'apps' that only make use of JS/OOUI/API to experiment with specific features. These could be hosted in a managed (yet separate) system, and you would 'hotload' them within alpha/beta/production versions of MediaWiki.

A similar approach for service development with the lessons learned from AWS lambda and similar features seems logical to me. It also fits within the ideas of PAWS and Quarry I think (which have been very successful).

TK-999 added a subscriber: TK-999.Oct 12 2019, 3:44 PM

Something to consider would be using a monorepo for hosting such services. This setup can bring some benefits (less administrative overhead as provisioning a new service wouldn't require creating a new repository/package, easier code and dependency sharing between services, the ability to share CI/CD configuration for all services).

Joe awarded a token.Oct 14 2019, 10:49 AM
WMDE-leszek reassigned this task from WMDE-leszek to Joe.Oct 17 2019, 12:17 PM
Joe reassigned this task from Joe to WMDE-leszek.Oct 17 2019, 1:43 PM
Joe updated the task description. (Show Details)
Joe added a subscriber: Joe.

I 'd add a couple of questions

Question: What are the risks involved in bypassing parts of the deployment process?
Significance: Ending up with PoCs serving even very small portions of production traffic (even if that is explicitly not allowed by this process) comes with security, scalability and reliability issues. Aka experimenting in production is generally a bad idea. How do we enforce

Question: How does this compare to WMCS?
Significance: WMCS has been the de facto place for this up to now. How do we communicate which usages for destined for WMCS and which are not?

Joe updated the task description. (Show Details)Oct 21 2019, 9:24 AM
WMDE-leszek reassigned this task from WMDE-leszek to Joe.Oct 22 2019, 6:25 PM
debt triaged this task as Medium priority.Oct 22 2019, 7:00 PM
greg added a comment.Oct 23 2019, 9:38 PM

(Programming note)

This session was accepted and will be scheduled.

Notes to the session leader

  • Please continue to scope this session and post the session's goals and main questions into the task description.
    • If your topic is too big for one session, work with your Program Committee contact to break it down even further.
    • Session descriptions need to be completely finalized by November 1, 2019.
  • Please build your session collaboratively!
    • You should consider breakout groups with report-backs, using posters / post-its to visualize thoughts and themes, or any other collaborative meeting method you like.
    • If you need to have any large group discussions they must be planned out, specific, and focused.
    • A brief summary of your session format will need to go in the associated Phabricator task.
    • Some ideas from the old WMF Team Practices Group.
  • If you have any pre-session suggested reading or any specific ideas that you would like your attendees to think about in advance of your session, please state that explicitly in your session’s task.
    • Please put this at the top of your Phabricator task under the label “Pre-reading for all Participants.”

Notes to those interested in attending this session

(or those wanting to engage before the event because they are not attending)

  • If the session leader is asking for feedback, please engage!
  • Please do any pre-session reading that the leader would like you to do.
debt updated the task description. (Show Details)Oct 25 2019, 9:28 PM
Nikerabbit updated the task description. (Show Details)Nov 4 2019, 2:17 PM
WMDE-leszek updated the task description. (Show Details)Nov 11 2019, 5:21 PM
WMDE-leszek added a subscriber: WDoranWMF.

Notes from session Etherpad:

Wikimedia Technical Conference
Atlanta, GA USA
November 12 - 15, 2019

Session Name / Topic
Self-service Stateless Microservices (for APIs)
Session Leader: Giuseppe; Facilitator: Brooke; Scribe: Brennen, Alex
https://phabricator.wikimedia.org/T234646

Session Attendees
Mate, Nikerabbit, Daniel Zahn, Florian, Bryan Davis, Addshore, Greg
Grossmeier, Alexandros, Riccardo...

Notes:

  • "I have a cool idea - and I don't want to wait 6 months before it sees the light"
  • Trying to understand if there's interest in something - sometimes you have a new idea, to bring it to production can be pretty frustrating.  Can take at least 6 months to get something to the light.
  • [interlude with encrypted volumes]
  • [slide] Could a production PaaS/FaaS (platform / function as service)
  • Basically want to gather feedback
  • [slide] Not every experiment/tool fits toolforge
  • Toolforge has limitations
    • private data
    • event streams can't connect to the private production kafka etc
  •  [slide] Workflow
    • create a repo
    • write a function in preferred lang
    • add build instructions (hello .pipeline)
    • respond to requests in your preferred language
  • [slide] Wait, that sounds a lot like the pipeline?
  •  [slide] ...but the pipeline has a lot of other steps you have to go through:
    • security review
    • new service request
    • perf / scaling
    • lvs setup
  • Various bottlenecks on various teams
  • curl faas.wiki/myfunction/v1 -X POST
  • Service is available under a namespace on a specific domain
  • Question: Does this fill an actual need?  Are you interested in this?
    •  BD: I actually have a use case for it: Stashbot - SAL logging in production - something we rely on, currently runs in toolforge - would be awesome if it ran somewhere where I wasn't the one who had to get out fo bed when it's broken
    • TpT: Also have a use case - OCR?
      • Extraction and conversion tools - PDF
    • BD: Addshore?
    • Addshore: I'm sure we have use cases, though I struggle to think of specifics from last few years.
    • GL: Termbox?
    • Addshore: Even maybe a prototype of termbox could have worked?
    • AK: What do I gain by deploying something here and not in toolforge?
    • GL: One side you're inside production, so you can have some secrets - in toolforge in theory you shouldn't have even an API key inside your image - people do, but you shouldn't.  Also access to production services - Kafka.  if you want to consume events on toolforge, you have to go thorugh EventStream
      • Prototyping - you're not sure it's going to be a full-fledged service.  For things like termbox, you can try it here.
      • Don't want to do them in toolforge because of perf, etc.
    • QUESTION: AK: Second one makes sense, first one doesn't:  Security review
    • GL: Let's say that for some things security review isn't going to be needed... ??? There are going to be restrictions we can talk about later.
    • Moritz Schubotz: Why not improving the workflow we currently have - I would rather make the steps we have more efficient, have a different a layer which is time restricted or osmething like that
    • GL: Some of the things in this workflow (the existing one) can be improved, but some cannot.  At least 3 gatekeeping points.  Mathoid is going to be in production forever (effectively) if we need to take 3 or 4 months to think about that, that's a good thing.  But for experiments...
    • MS: Because security.  If I just install packages from NPM randomly, it's already a security concern...
    • GL: We already do that, but...
    • J ???: Want to mention that it seems useful - ??? - would be a great improvement over the system we have 
      • Question I have is what's the caching story
    • GL: If it's going to be in production, it's absolutely going to be cached.  You can control from your application, but probably cached.
    • ---
    • GG: One of the sessions yesterday was about supporting non-staff / community members in developing - explicit lists of maintainers, a nod towards long-term support/sunsetting if needed
    • GL: We can discuss this with the next question...
    • Mate: This might be going more into implementation details
      • Might make it easier ???
    • P?: We still have a gatekeeping point?  We have a review to decide whether we need a security review?
      • This is designed for temporary solutions?  Either it becomes a permanent service and gets all the review or it's undeployed?
      • How do we not wind up with 100 things running in prod?
  • [Slide] Question: Assuming this is limited to pure lambda, under a different domain than the wikis, not part of the necessary path to serve wikis - will this fit your use case?
    • No authn/authz, no sessions, no cookie access, no PII (seems right)?
    • Get input, provide output
    • No storage
    • Under a different domain than the main project they power (e.g. NOT en.wikimedia.org
    • Not part of the necessary/critical path to serve wikis (e.g. an addon, a beta feature)
    • It should live somewhere else as soon as it's required to render a page ?
    • Example: termbox - it's in the critical path to serve something to users
    • MS: I'm still not entirely sure if I understand - what's the difference between a special project on toolforge and giving them some data...  What's the difference between this approach and just exposing some data to toolforge via an API
    • GL: I don't think anybody wants to do that?
    • DA: I don't think stashbot would work because of storage?
    • ...
    • GL: You don't need a database.
    • BD: Stashbot's primary store is ElasticSearch - the onwiki stuff is an accident
    • [this seems like an important point, not sure how to summarize]
    • JH: Seems like it'd be useful for prototyping - seems like it would be useful for there to be a clear migration path
    • GL: We could revisit the question of security review - some things could get more review and have more options
    • DA: Do you need to be in the NDA to do this?
    • GL: To deploy it probably you need someone...
    • DA: Can someone from the community who's not under NDA write one of these?
    • GL: I'm not even sure that everyone who signed an NDA should be able to do this.
      • Necessary but not sufficient condition.
      • More or less the way we do it with deployment rights - trust is extended to some persons.
      • Not sure what the limitations for toolforge are...  Pretty much everybody?
      • BD: Yep
    • BD: I don't think it would make sense to have something as free as toolforge that's just "you maintain it instead of we maintain it"
    • Brooke: Do you have a sense of what production resources would be available to such a service?
    • GL: No...  A reasonable amount.  Also think it should be able to do some amount of scaling...
    • DA: I don't want to be myopic, but I think this is kind of what we're thinking of with the event stream processing platform in analytics...
      • Quick simple units of logic on top of streams
      • Incompatibility b/w no security review and being able to access private data
      • If you're connecting to private data then you can't render
      • You can prototype in a steram processing platform, you're ready to render you go to review
    • GL: Some lambda sitting on top of ???
      • Some new visualization...  Geolocation of articles for example.
    • DA: You know the new MW.org logo?  Gonna light up for edits...
      • Just kidding.
    • Grant: Why can't you be part of the necessary/critical path?
    • GL: Because the moment it's part of the necessary path, you want the architecture to be reviewed so you're not blindsided by something creating strange loops.  Also you want to test the performance of it.  You can't insert a thing that takes 5 seconds to render.  We do these things now.  We didn't used to, and we created a mess.  We don't want to go back to that.
      • If you're not part of the necessary path, you're freer to work.
    • Grant: So you're conflating stateless microservices with self-serve?
    • GL: Yes, it's a combination of both.
    • Grant: Do we have stateless microservices as an option already?
    • Several people: Yes.
    • GL: Rendering ... - new rendering experiments, .e.g a new PS4 rendering option
    • MS: Can you comment on the example about OCR / PDFs
    • GL: You're talking about creating PDFs from wiki content...  I think that would be a good fit, wouldn't work for private wikis.  A perfect use case if you want to do some kind of specific PDF-rendering of wiki content, that's probably one way to go.
    • BD: The thing they're talking about exists in toolforge - ws export ???
    • GL: Given success, should probably graduate to a full service at some point...
  • - Question: What should never be on such a system?
    • What's a very bad idea?
    • What aspects didn't I consider?
    • AK: Any form of data store?
    • GL: One example could be stashbot
    • BD: If you're trying to stick with add-on compositing as a core use case...  Your restriction on authn/authz - that implies any PII.
    • GL: So I still get your IP with this, so there are some PIIs involved
    • BD: Do I have to get your IP?  Inside toolforge, a tool doesn't get the user's IP...
    • GL: ... ??? antispam?
    • GL: I'm not sure that without proper security review you can have access to PII - we have to think about this.
    • DA: Kind of hate to say this, but let's say we created like 5 amazing tools on this system that users loved in a month...  That would be bad.  We'd have to productionize those.  Now 5 people made 5 prototypes...
    • Grant:  It's self service.
    • GL: The point that Dan was trying to make is let's say these tools have great success and we want to integrate them...  Then probably we need to think about making them a proper part of production.
    • Addshore:  I think we already have this problem.  Loads of critical stuff is on tools
    • Florian: If they're really successful tools where is the problem with that?
    • DA: Resourcing!
    • Grant: The alternative is you wait for a year to get things implemented
    • Addshore: Having tools as an alternative is great, this is another level...  Things could stay off to the side with more capacity
    • MS: I get the feeling that we're solving the problem that security review just takes too long - wouldn't it solve problems if we just added resources to security review?
    • GL: On one side, security review is just one of the steps, and I'm not sure that it's the longest - and maybe you want it anyway...  Maybe that part is necessary.  But really trying to reduce friction for people to *try things*.  The point is to try things and iterate.  If we have 5 successful experiments, it's a product decision.
      • This is a question for Grant or Toby.
      • I don't think that raising the bar to create something helps in any way with our resourcing problem
    • DA: What about maps?
    • GL: Maps doesn't fit this because it has lots of traffic, lots of storage
    • GL: Graphoid on the other hand could have been born on such a thing - let's say graphoid was created here and went on to become a fundamental tool - we would have to make a public decision.  That's not the way it went.  
    • P: Auth point - If you can't auth, you're not much different from toolforge.
    • GL: Probably some sort of token ???
    • GL: I'm not talking about secrets you can use to access stuff in production
      • You want to access some sort of Google API - you can't do that on toolforge unless you're ok with everybody being able to read it
    • BD: It's root escalations.  There's no protection against root escalations in toolforge.
    • GG: What next steps?
    • GL: I gather feedback here, I expand Phabricator ticket...  There's some interest.  I want people to be able to experiment on production wikis without being stopped by process.
    • GG: My summary thought on this is we had a big session yesterday that touched on a lot of these issues - the social maint / long term support issues.  We should try to separate those from your proposal as much as we can so that you can move ahead with something that would meet needs for some people, while we move forward with the social issues...
    • GL: You're specifically referring to support / lifecycle stuff
    • GG: Yeah.
    • Brooke: I'm also curious if this is something that might turn into a feature request for cloud services, aka functions as a service
    • GL: I would be happy if somebody says great idea, I want to implement that. It could be a possibility for sure.
    • BD: We could enable you by offering prototyping...
    • GL: First we have to find resourcing...
    • JH: A suggestion - consider other kinds of triggers - stream triggers
    • GL: I'm not super sold on the serverless fad in general, but that concept can be molded into something actually useful in general
      • If you look at open function as a service...  It's CGI reinvented on k8s.  It's exactly CGI just distributed across a k8s cluster.
      • We can do something better than that.
    • GG: cgi-bin revolutionized the internet!
  • GL: Back to workflow side - some of these gateways are human.  We can streamline processes, but some of these are technical limitations and that's what I want to solve ???
    • I agree that we need to optimize this part [process] - you can give the security team one person per team - great, but...
greg closed this task as Resolved.Dec 17 2019, 10:55 PM

Thanks for making this a good session at TechConf this year. Follow-up actions are recorded in a central planning spreadsheet (owned by me) and I'll begin farming them out to responsible parties in January 2020.