I don't find the wording in the RFC to be that ambiguous actually.
> The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.
The "possibly preface" (sic!) to me is obviously to be understood as "if there are any CNAME RRs, the answer to the query is to be prefaced by those CNAME RRs" and not "you can preface the query with the CNAME RRs or you can place them wherever you want".
I agree this doens't seem too ambiguous - it's "you may do this.." and they said "or we may do the reverse". If I say you're could prefix something.. the alternative isn't that you can suffix it.
But also.. the programmers working on the software running one of the most important (end-user) DNS servers in the world:
1. Changes logic in how CNAME responses are formed
2. I assume some tests at least broke that meant they needed to be "fixed up" (y'know - "when a CNAME is queried, I expect this response")
3. No one saw these changes in test behavoir and thought "I wonder if this order is important". Or "We should research more into this", Or "Are other DNS servers changing order", Or "This should be flagged for a very gradual release".
4. Ends up in test environment for, what, a month.. nothing using getaddrinfo from glibc is being used to test this environment or anyone noticed that it was broken
Cloudflare seem to be getting into thr swing of breaking things and then being transparent. But this really reads as a fun "did you know", not a "we broke things again - please still use us".
There's no real RCA except to blame an RFC - but honestly, for a large-scale operation like there's this seems very big to slip through the cracks.
I would make a joke about South Park's oil "I'm sorry".. but they don't even seem to be
The article makes it very clear that the ambiguity arises in another phrase: “difference in ordering of the RRs in the answer section is not significant”, which is applied to an example; the problem with examples being that they are illustrative, viz. generalisable, and thus may permit order ambiguity everywhere, and in any case, whether they should or shouldn’t becomes a matter of pragmatic context.
Which goes to show, one person’s “obvious understanding” is another’s “did they even read the entire document”.
All of which also serves to highlight the value of normative language, but that came later.
I agree with you, and I also think that their interpretation of example 6.2.1 in the RFC is somewhat nonsensical. It states that “The difference in ordering of the RRs in the answer section is not significant.” But from the RFC, very clearly this comment is relevant only to that particular example; it is comparing two responses and saying that in this case, the different ordering has no semantic effect.
And perhaps this is somewhat pedantic, but they also write that “RFC 1034 section 3.6 defines Resource Record Sets (RRsets) as collections of records with the same name, type, and class.” But looking at the RFC, it never defines such a term; it does say that within a “set” of RRs “associated with a particular name” the order doesn’t matter. But even if the RFC had said “associated with a particular combination of name, type, and class”, I don’t see how that could have introduced ambiguity. It specifies an exception to a general rule, so obviously if the exception doesn’t apply, then the general rule must be followed.
Anyway, Cloudflare probably know their DNS better than I do, but I did not find the article especially persuasive; I think the ambiguity is actually just a misreading, and that the RFC does require a particular ordering of CNAME records.
(ETA:) Although admittedly, while the RFC does say that CNAMEs must come before As in the answer, I don’t necessarily see any clear rule about how CNAME chains must be ordered; the RFC just says “Domain names in RRs which point at another name should always point at the primary name and not the alias ... Of course, by the robustness principle, domain software should not fail when presented with CNAME chains or loops; CNAME chains should be followed”. So actually I guess I do agree that there is some ambiguity about the responses containing CNAME chains.
Even if 'possibly preface' is interpreted to mean CNAME RRSets should appear first there is still a broken reliance by some resolvers on the order of CNAME RRsets if there is more than one CNAME in the chain. This expectation of ordering is not promised by the relevant RFCs.
Isn't this literally noted in the article? The article even points out that the RFC is from before normative words were standardized for hard requirements.
The context makes it less clear, but even if we pretend that part is crystal, a comment that stops there is missing the point of the article. All CNAMEs at the start isn't enough. The order of the CNAMEs can cause problems despite perfect RFC compliance.
"With a sufficient number of users of an API,
it does not matter what you promise in the contract:
all observable behaviors of your system
will be depended on by somebody."
combined with failure to follow Postel's Law:
"Be conservative in what you send, be liberal in what you accept."
What's reasonable is: "Set reserved fields to 0 when writing and ignore them when reading." (I heard that was the original example). Or "Ignore unknown JSON keys" as a modern equivalent.
What's harmful is: Accept an ill defined superset of the valid syntax and interpret it in undocumented ways.
Funny I never read the original example. And in my book, it is harmful, and even worse in JSON, since it's the best way to have a typo somewhere go unnoticed for a long time.
I disagree. I find accepting extra random bytes in places to be just as harmful. I prefer APIs that push back and tell me what I did wrong when I mess up.
Good modern protocols will explicitly define extension points, so 'ingoring unknown JSON keys' is in-spec rather than assumed that an implementer will do.
Very much so. A better law would be conservative in both sending and accepting, as it turns out that if you are liberal in what you accept, senders will choose to disobey Postel's law and be liberal in what they send, too.
"Warnings" are like the most difficult thing to 'send' though. If an app or service doesn't outright fail, warnings can be ignored. Even if not ignored... how do you properly inform? A compiler can spit out warnings to your terminal, sure. Test-runners can log warnings. An RPC service? There's no standard I'm aware of. And DNS! Probably even worse. "Yeah, your RRs are out of order but I sorted them for you." where would you put that?
That sounds awful and will send administrators on a wild goose chase throughout their stack to find the issue without many clues except this thing is failing at seemingly random times. (I myself would suspect something related to network connectivity, maybe requests are timing out? This idea would lead me in the completely wrong direction.)
It also does not give any way to actually see a warning message, where would we even put it? I know for a fact that if my glibc DNS resolver started spitting out errors into /var/log/god_knows_what I would take days to find it, at best the resolver could return some kind of errno with perror giving us a message like "The DNS response has not been correctly formatted", and then hope that the message is caught and forwarded through whatever is wrapping the C library, hopefully into our stderr. And there's so many ways even that could fail.
The Python 3 community was famously divided on that matter, wrt Python 3. Now that it is over, most people on the "accept liberally" side of the fence have jumped sides.
That's true, but sort of misses the spirit of Hyrum's law (which is that the world is filled with obscure edge cases).
In this case the broken resolver was the one in the GNU C Library, hardly an obscure situation!
The news here is sort of buried in the story. Basically Cloudflare just didn't test this. Literally every datacenter in the world was going to fail on this change, probably including their own.
It's remarkable that the ordinary DNS lookup function in glibc doesn't work if the records aren't in the right order. It's amazing to me we went 20+ years without that causing more problems. My guess is most people publishing DNS records just sort of knew that the order mattered in practice, maybe figuring it out in early testing.
I think it's more of a server side ordering, in which there were not that many DNS servers out there, and the ones that didn't keep it in order quickly changed the behavior because of interop.
> While in our interpretation the RFCs do not require CNAMEs to appear in any particular order, it’s clear that at least some widely-deployed DNS clients rely on it. As some systems using these clients might be updated infrequently, or never updated at all, we believe it’s best to require CNAME records to appear in-order before any other records.
Now that I have seemingly taken on managing DNS at my current company I have seen several inadequacies of DNS that I was not aware of before. Main one being that if an upstream DNS server returns SERVFAIL, there is no distinction really between if the server you are querying is failed, or the actual authoritative server upstream is broken (I am aware of EDEs but doesn't really solve this). So clients querying a broken domain will retry each of their configured DNS servers, and our caching layer (Unbound) will also retry each of their upstreams etc... Results in a bunch of pointless upstream queries like an amplification attack. Also have issue with the search path doing stupid queries with NXDOMAIN like badname.company.com, badname.company.othername.com... etc..
I would expect, that dns servers like 1.1.1.1 at this scale have integration tests running real resolvers, like the one in glibc. How come this issue was discovered only in production?
Many rightfully interpret the RFC as "CNAME have to be before A", but the issue persists inbetween CNAMEs in the chain as noted in the article. If a record in the middle of the chain expires, glibc would still fail if the "middle" record was to be inserted between CNAMEs and A records.
Doesn't the precipitating change optimize memory on the DNS server at the expense of additional memory usage across millions of clients that now need to parse an unordered response?
I don't know the official process, but as a human that sometimes reads and implements IETF RFCs, I'd appreciate updates to the original doc rather than replacing it with something brand new. Probably with some dated version history.
Otherwise I might go to consult my favorite RFC and not even know its been superseded. And if it has been superseded with a brand new doc, now I have to start from scratch again instead of reading the diff or patch notes to figure out what needs updating.
And if we must supersede, I humbly request a warning be put at the top, linking the new standard.
> RFC 1034, published in 1987, defines much of the behavior of the DNS protocol, and should give us an answer on whether the order of CNAME records matters. Section 4.3.1 contains the following text:
> If recursive service is requested and available, the recursive response to a query will be one of the following:
> - The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.
> While "possibly preface" can be interpreted as a requirement for CNAME records to appear before everything else, it does not use normative key words, such as MUST and SHOULD that modern RFCs use to express requirements. This isn’t a flaw in RFC 1034, but simply a result of its age. RFC 2119, which standardized these key words, was published in 1997, 10 years after RFC 1034.
It's pretty clear that CNAME is at the beginning.
The "possibly" does not refer to the order but rather to the presence.
> However, we did not have any tests asserting the behavior remains consistent due to the ambiguous language in the RFC.
Maybe I'm being overly-cynical but I have a hard time believing that they deliberately omitted a test specifically because they reviewed the RFC and found the ambiguous language. I would've expected to see some dialog with IETF beforehand if that were the case. Or some review of the behavior of common DNS clients.
It seems like an oversight, and that's totally fine.
I took it as being "we wrote the tests to the standard" and then built the code, and whoever was writing the tests didn't read that line as a testable aspect.
It's implied that they intentionally tested it that way, without any assertions on the order. Not by oversight of incompetence, but because they didn't want to bake the requirement in due to uncertainty.
Given my years of experience with Cisco "quality", I'm not surprised by this:
> Another notable affected implementation was the DNSC process in three models of Cisco ethernet switches. In the case where switches had been configured to use 1.1.1.1 these switches experienced spontaneous reboot loops when they received a response containing the reordered CNAMEs.
... but I am surprised by this:
> One such implementation that broke is the getaddrinfo function in glibc, which is commonly used on Linux for DNS resolution.
Not that glibc did anything wrong -- I'm just surprised that anyone is implementing an internet-scale caching resolver without a comprehensive test suite that includes one of the most common client implementations on the planet.
After the release got reverted, it took an 1hr28min for the deployment to propagate. You'd think that would be a very long time for CloudFlare infrastructure.
We should probably all be glad that CloudFlare doesn't have the ability to update its entire global fleet any faster than 1h 28m, even if it’s a rollback operation.
Any change to a global service like that, even a rollback (or data deployment or config change), should be released to a subset of the fleet first, monitored, and then rolled out progressively.
Nice analysis. Boy I can’t imagine having to work at Cloudflare on this stuff. A month to get your “small in code” change out only to find some bums somewhere have written code that will make it not work.
Or when working on massive infrastructure like this, you write plenty of tests that would have saved you a month worth of work.
They write reordering, push it and glibc tester fires, fails and you quickly discover "Crap, tests are failing and dependency (glibc) doesn't work way I thought it would."
> The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.
The "possibly preface" (sic!) to me is obviously to be understood as "if there are any CNAME RRs, the answer to the query is to be prefaced by those CNAME RRs" and not "you can preface the query with the CNAME RRs or you can place them wherever you want".
But also.. the programmers working on the software running one of the most important (end-user) DNS servers in the world:
1. Changes logic in how CNAME responses are formed
2. I assume some tests at least broke that meant they needed to be "fixed up" (y'know - "when a CNAME is queried, I expect this response")
3. No one saw these changes in test behavoir and thought "I wonder if this order is important". Or "We should research more into this", Or "Are other DNS servers changing order", Or "This should be flagged for a very gradual release".
4. Ends up in test environment for, what, a month.. nothing using getaddrinfo from glibc is being used to test this environment or anyone noticed that it was broken
Cloudflare seem to be getting into thr swing of breaking things and then being transparent. But this really reads as a fun "did you know", not a "we broke things again - please still use us".
There's no real RCA except to blame an RFC - but honestly, for a large-scale operation like there's this seems very big to slip through the cracks.
I would make a joke about South Park's oil "I'm sorry".. but they don't even seem to be
Which goes to show, one person’s “obvious understanding” is another’s “did they even read the entire document”.
All of which also serves to highlight the value of normative language, but that came later.
And perhaps this is somewhat pedantic, but they also write that “RFC 1034 section 3.6 defines Resource Record Sets (RRsets) as collections of records with the same name, type, and class.” But looking at the RFC, it never defines such a term; it does say that within a “set” of RRs “associated with a particular name” the order doesn’t matter. But even if the RFC had said “associated with a particular combination of name, type, and class”, I don’t see how that could have introduced ambiguity. It specifies an exception to a general rule, so obviously if the exception doesn’t apply, then the general rule must be followed.
Anyway, Cloudflare probably know their DNS better than I do, but I did not find the article especially persuasive; I think the ambiguity is actually just a misreading, and that the RFC does require a particular ordering of CNAME records.
(ETA:) Although admittedly, while the RFC does say that CNAMEs must come before As in the answer, I don’t necessarily see any clear rule about how CNAME chains must be ordered; the RFC just says “Domain names in RRs which point at another name should always point at the primary name and not the alias ... Of course, by the robustness principle, domain software should not fail when presented with CNAME chains or loops; CNAME chains should be followed”. So actually I guess I do agree that there is some ambiguity about the responses containing CNAME chains.
I just commented the same.
It's pretty clear that the "possibly" refers to the presence of the CNAME RRs, not the ordering.
"With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody."
combined with failure to follow Postel's Law:
"Be conservative in what you send, be liberal in what you accept."
What's reasonable is: "Set reserved fields to 0 when writing and ignore them when reading." (I heard that was the original example). Or "Ignore unknown JSON keys" as a modern equivalent.
What's harmful is: Accept an ill defined superset of the valid syntax and interpret it in undocumented ways.
It also does not give any way to actually see a warning message, where would we even put it? I know for a fact that if my glibc DNS resolver started spitting out errors into /var/log/god_knows_what I would take days to find it, at best the resolver could return some kind of errno with perror giving us a message like "The DNS response has not been correctly formatted", and then hope that the message is caught and forwarded through whatever is wrapping the C library, hopefully into our stderr. And there's so many ways even that could fail.
Through the appropriate channels; in-band and out-of-band.
In this case the broken resolver was the one in the GNU C Library, hardly an obscure situation!
The news here is sort of buried in the story. Basically Cloudflare just didn't test this. Literally every datacenter in the world was going to fail on this change, probably including their own.
CNAMES are a huge pain in the ass (as noted by DJB https://cr.yp.to/djbdns/notes.html)
That's the only reasonable conclusion, really.
Reminds me of https://news.ycombinator.com/item?id=37962674 or see https://tech.tiq.cc/2016/01/why-you-shouldnt-use-cloudflare/
It’s always DNS.
Also no, the client doesn't need more memory to parse the out-of-order response, it can take multiple passes through the kilobyte.
Also, what's the right mental framework behind deciding when to release a patch RFC vs obsoleting the old standard for a comprehensive update?
Otherwise I might go to consult my favorite RFC and not even know its been superseded. And if it has been superseded with a brand new doc, now I have to start from scratch again instead of reading the diff or patch notes to figure out what needs updating.
And if we must supersede, I humbly request a warning be put at the top, linking the new standard.
https://datatracker.ietf.org/doc/html/rfc5245
I agree, that it would be much more helpful if made obvious in the document itself.
It's not obvious that "updated by" notices are treated in any more of a helpful manner than "obsoletes"
> If recursive service is requested and available, the recursive response to a query will be one of the following:
> - The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.
> While "possibly preface" can be interpreted as a requirement for CNAME records to appear before everything else, it does not use normative key words, such as MUST and SHOULD that modern RFCs use to express requirements. This isn’t a flaw in RFC 1034, but simply a result of its age. RFC 2119, which standardized these key words, was published in 1997, 10 years after RFC 1034.
It's pretty clear that CNAME is at the beginning.
The "possibly" does not refer to the order but rather to the presence.
If they are present, they are are first.
Maybe I'm being overly-cynical but I have a hard time believing that they deliberately omitted a test specifically because they reviewed the RFC and found the ambiguous language. I would've expected to see some dialog with IETF beforehand if that were the case. Or some review of the behavior of common DNS clients.
It seems like an oversight, and that's totally fine.
> Another notable affected implementation was the DNSC process in three models of Cisco ethernet switches. In the case where switches had been configured to use 1.1.1.1 these switches experienced spontaneous reboot loops when they received a response containing the reordered CNAMEs.
... but I am surprised by this:
> One such implementation that broke is the getaddrinfo function in glibc, which is commonly used on Linux for DNS resolution.
Not that glibc did anything wrong -- I'm just surprised that anyone is implementing an internet-scale caching resolver without a comprehensive test suite that includes one of the most common client implementations on the planet.
Any change to a global service like that, even a rollback (or data deployment or config change), should be released to a subset of the fleet first, monitored, and then rolled out progressively.
They write reordering, push it and glibc tester fires, fails and you quickly discover "Crap, tests are failing and dependency (glibc) doesn't work way I thought it would."
It's surprising how something so simple can be so broken.