Page MenuHomePhabricator

Allow special characters in domains
Closed, ResolvedPublic

Description

então.carolinadoran.com broke a lot of stuff
We need to make sure this doesn't happen again, so fix the domain and allow this case in the future. Also fix the Wiki we currently blocked.

AC:

  • We defined which special characters are actually acceptable in general standards
  • MediaWiki allows users to use a domain, .wikibase.cloud or custom, containing special characters as defined in the first AC

PRs:

Event Timeline

Evelien_WMDE renamed this task from Enhance validation on domain name field upon Wiki creation to Allow special characters in domains.Aug 29 2023, 9:50 AM
Evelien_WMDE updated the task description. (Show Details)

Would this be easy to implement soon? If not maybe (at least) to have redirects as fallback?

FYI I also saw this while digging for other ingress issues. Look like k8s ingress implementation might not be all that happy with special chars either:

{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "authenticationInfo": {
      "principalEmail": "system:serviceaccount:default:api-defaultrole"
    },
    "authorizationInfo": [
      {
        "granted": true,
        "permission": "io.k8s.networking.v1.ingresses.create",
        "resource": "networking.k8s.io/v1/namespaces/default/ingresses/mediawiki-site-584"
      }
    ],
    "methodName": "io.k8s.networking.v1.ingresses.create",
    "request": {
      "@type": "networking.k8s.io/v1.Ingress",
      "apiVersion": "networking.k8s.io/v1",
      "kind": "Ingress",
      "metadata": {
        "annotations": {
          "cert-manager.io/cluster-issuer": "letsencrypt-prod",
          "kubernetes.io/ingress.class": "nginx",
          "nginx.ingress.kubernetes.io/force-ssl-redirect": "true"
        },
        "creationTimestamp": null,
        "labels": {
          "app.kubernetes.io/managed-by": "wbstack-platform",
          "wbstack-ingress-generation": "2020-04-18.1",
          "wbstack-wiki-domain": "então.carolinadoran.com",
          "wbstack-wiki-id": "584"
        },
        "name": "mediawiki-site-584",
        "namespace": "default"
      },
      "spec": {
        "rules": [
          {
            "host": "então.carolinadoran.com",
            "http": {
              "paths": [
                {
                  "backend": {
                    "service": {
                      "name": "platform-nginx",
                      "port": {
                        "number": 8080
                      }
                    }
                  },
                  "path": "/",
                  "pathType": "Prefix"
                }
              ]
            }
          }
        ],
        "tls": [
          {
            "hosts": [
              "então.carolinadoran.com"
            ],
            "secretName": "mediawiki-site-tls-584"
          }
        ]
      },
      "status": {
        "loadBalancer": {}
      }
    },
    "requestMetadata": {
      "callerIp": "192.168.0.2",
      "callerSuppliedUserAgent": "GuzzleHttp/6.5.5 curl/7.74.0 PHP/7.4.33"
    },
    "resourceName": "networking.k8s.io/v1/namespaces/default/ingresses/mediawiki-site-584",
    "response": {
      "@type": "core.k8s.io/v1.Status",
      "apiVersion": "v1",
      "code": 422,
      "details": {
        "causes": [
          {
            "field": "metadata.labels",
            "message": "Invalid value: \"então.carolinadoran.com\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')",
            "reason": "FieldValueInvalid"
          },
          {
            "field": "spec.rules[0].host",
            "message": "Invalid value: \"então.carolinadoran.com\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')",
            "reason": "FieldValueInvalid"
          },
          {
            "field": "spec.tls[0].hosts[0]",
            "message": "Invalid value: \"então.carolinadoran.com\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')",
            "reason": "FieldValueInvalid"
          }
        ],
        "group": "extensions",
        "kind": "Ingress",
        "name": "mediawiki-site-584"
      },
      "kind": "Status",
      "message": "Ingress.extensions \"mediawiki-site-584\" is invalid: [metadata.labels: Invalid value: \"então.carolinadoran.com\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?'), spec.rules[0].host: Invalid value: \"então.carolinadoran.com\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.tls[0].hosts[0]: Invalid value: \"então.carolinadoran.com\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')]",
      "metadata": {},
      "reason": "Invalid",
      "status": "Failure"
    },
    "serviceName": "k8s.io",
    "status": {
      "code": 9,
      "message": "Ingress.extensions \"mediawiki-site-584\" is invalid: [metadata.labels: Invalid value: \"então.carolinadoran.com\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?'), spec.rules[0].host: Invalid value: \"então.carolinadoran.com\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.tls[0].hosts[0]: Invalid value: \"então.carolinadoran.com\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')]"
    }
  },
  "insertId": "deeb37f6-5ce1-4d2c-a6e9-a67a62e541a6",
  "resource": {
    "type": "k8s_cluster",
    "labels": {
      "project_id": "wikibase-cloud",
      "cluster_name": "wbaas-3",
      "location": "europe-west3-a"
    }
  },
  "timestamp": "2023-08-22T22:18:41.247958Z",
  "labels": {
    "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"api-defaultrole\" of ClusterRole \"api-defaultrole\" to ServiceAccount \"api-defaultrole/default\"",
    "authorization.k8s.io/decision": "allow"
  },
  "logName": "projects/wikibase-cloud/logs/cloudaudit.googleapis.com%2Factivity",
  "operation": {
    "id": "deeb37f6-5ce1-4d2c-a6e9-a67a62e541a6",
    "producer": "k8s.io",
    "first": true,
    "last": true
  },
  "receiveTimestamp": "2023-08-22T22:18:53.304837045Z"
}

I was able to reproduce the error locally by dumping the k8s resource schema of the ingress (by calling getSchema() on this Ingress object https://github.com/wbstack/api/blob/main/app/Jobs/KubernetesIngressCreate.php#L44)

The Ingress "mediawiki-site-11" is invalid: 
* metadata.labels: Invalid value: "então7.carolinadoran.com": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')
* spec.rules[0].host: Invalid value: "então7.carolinadoran.com": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
* spec.tls[0].hosts[0]: Invalid value: "então7.carolinadoran.com": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')

Interestingly enough, only after I altered the schema to contain the special character. The output of getSchema already translated it into it's punycode representation: entu00e3o7.carolinadoran.com (note: the 7 in this example is not important, it was just the 7th attempt on my local testing cluster) (edit: this was not punycode, the special character just got escaped)

I was able to just apply the unaltered schema with kubectl, and it did in fact create an ingress:

{
    "kind": "Ingress",
    "apiVersion": "networking.k8s.io/v1",
    "metadata": {
        "name": "mediawiki-site-11",
        "namespace": "default",
        "labels": {
            "wbstack-wiki-id": "11",
            "wbstack-wiki-domain": "entu00e3o7.carolinadoran.com",
            "wbstack-ingress-generation": "2020-04-18.1",
            "app.kubernetes.io/managed-by": "wbstack-platform"
        },
        "annotations": {
            "kubernetes.io/ingress.class": "nginx",
            "nginx.ingress.kubernetes.io/force-ssl-redirect": "true",
            "cert-manager.io/cluster-issuer": "letsencrypt-prod"
        }
    },
    "spec": {
        "tls": [
            {
                "hosts": [
                    "entu00e3o7.carolinadoran.com"
                ],
                "secretName": "mediawiki-site-tls-11"
            }
        ],
        "rules": [
            {
                "host": "entu00e3o7.carolinadoran.com",
                "http": {
                    "paths": [
                        {
                            "path": "/",
                            "pathType": "Prefix",
                            "backend": {
                                "service": {
                                    "name": "platform-nginx",
                                    "port": {
                                        "number": 8080
                                    }
                                }
                            }
                        }
                    ]
                }
            }
        ]
    }
}

Which leads me to the question, why does it fail to apply via the php library?

But even *if* it would apply correctly, I think we still need to add some transcoding logic *somewhere* (probably more than one place) in our system, since I think the ingress I created with kubectl literally works for entu00e3o7.carolinadoran.com instead of então7.carolinadoran.com.

I put up a PR which stores domains in punycode. I'm not sure that is the right approach as we have to decode it in places where we want to (API responses for users, as of now only done for the /details endpoint). https://github.com/wbstack/api/pull/652

As of now this seems fine, but I'm not so sure about it. For example, the data for the discovery feature still gets the encoded domain, which is fine, because the domain is not dispalyed currently and browsers handle it correctly.

I see three options:

  1. use it as it is
  2. leave the decoding completely to the API consumers (= UI)
  3. only encode to punycode for ingress creation, never store it, and decode it everywhere at mediawiki side where necessary

I'm interested if option 3 would work easily, but I could also see how handling domains in punycode only could prevent other encoding issues that may be lurking somewhere else.

edit: I tried it, and mediawiki needs to decode it at different stages plus possibly other components as well (QueryService? ES?)

So while it seems a bit hacky I would still suggest going forward with the current approach. I thought about adding some transformation method to the WikiDomain model but then again the Wiki model isn't even using the WikiDomain model.

Deniz_WMDE moved this task from Doing to In Review on the Wikibase Cloud (Kanban board Q3 2023) board.
Deniz_WMDE subscribed.

One place where this is notably "ugly" but I think is also perfectly acceptable is the Queryservice UI:

image.png (892×956 px, 71 KB)

The domain name punycode encoded at the top looks rather rough but the link still works fine.

The actual URIs in the triples also look unpleasant but I looked it up and they should indeed do this to follow the spec: https://www.w3.org/TR/rdf11-concepts/#h3_section-IRIs where they should be normalised to this punycode encoded form

I updated the ADR and the implementation to make use of the IDN format instead of punycode. While IDN is using punycode, it and it's implementation is not the same, as it is specific for domains. This also allowed for refraining from using a 3rd party package as there are native PHP functions for that: https://www.php.net/manual/en/ref.intl.idn.php. I implemented a wrapper with some test cases so we can be sure we use it in a way that we expect it to work.

I implemented the unicode conversion in the UI here: https://github.com/wbstack/ui/pull/742

But I have to say that it doesn't feel right for us needing to alter the API data everywhere where domains appear. On the other hand, this would be needed then on the API side. I wonder if we can find some middle ground, maybe a new field in the API response, or a generic conversion if the value is casted to String (Stringable interface).

edit: I just learned about Eloquent accessors, this looks like the right thing to use: https://laravel.com/docs/8.x/eloquent-serialization#appending-values-to-json

We just would have to be careful to use it when not the whole model data is used by default, here for instance https://github.com/wbstack/api/pull/652/files#diff-693a2896e94f046cb59f53fa43a9a46e482734ca7c4facc8540f8e0cf35bc438

Must not forget to also ship UI change.

We also want to notify the user that they can now try again with this non-ascii domain.

Evelien_WMDE claimed this task.