MokaFive @mokafive-blog - Tumblr Blog

Posts

A team of MokaFivers won the /dev/null hackathon at EA this past weekend. Read the harrowing tale written by Jasson McMorris, MokaFive engineer. "My hands flew across the keyboard like a hummingbird flapping its wings; ..."

#mokafive #hackathon #/dev/null #victory #hacking

OpenSSL 1.0 + SNI + curl = interoperability #fail

I was messing around with a new install of Apache 2.4.3 with the new OpenSSL library and ran into an interesting problem. MokaFive Player would fail to connect to my Apache server, but I could connect fine from my web browser. libcurl was returning CURLE_SSL_CONNECT_ERROR (which had this mildly entertaining and uninformative message: "wrong when connecting with SSL"). The SSL certs and ciphers were all set up correctly, so what gives?

I turned on cURL verbose logging to see what was the matter, and saw this error line:

error:14077458:SSL routines:SSL23_GET_SERVER_HELLO:reason(1112)

What's that? Decoding the "1112" reason means this is a TLS Alert with Level 1 (Warning) and Code 112 (Unrecognized name).

This gets into the SSL "SNI" (Server Name Indication) feature, which is a nifty recent extension to SSL where the client sends the name of the host it is trying to connect to at the start of the SSL handshake process. This is really useful because it allows you to run different virtual hosts on the same IP address and port. At connection time, the server can look at what host the client is connecting to and present the correct certificate. Even cooler, proxies and load balancers can look at the SNI name and just forward the raw socket traffic to the appropriate server. The proxy/load balancer doesn't need to terminate the SSL connection - the SSL connection can terminate at the real server, which is much nicer.

Before SNI, people would do this with HTTP "Host" headers. But SNI is much better because you don't need to decode HTTP headers (which requires terminating the SSL connection to read and parse the headers). And of course without SNI the server can't present the right certificate during SSL negotiation because it doesn't know which certificate you want until after you establish the connection and read the Host header.

Anyway, why does this fail with cURL and not with any browser? It turns out to be a strange interaction between different OpenSSL versions, cURL, and Apache. Basically, when you are using cURL compiled with OpenSSL >=0.9.8g but <1.0 and are connecting to an server with OpenSSL >=1.0, *and* the hostname sent via SNI does not match what the server expects, the server will return an advisory warning "unrecognized name", but the server will continue to process the request anyway using the default virtual host definition. Every client on the planet except cURL will ignore the advisory warning and keep going. But cURL will fail the connection.

There are a few workarounds, however. If you tell cURL explicitly what SSL version to use when connecting, it won't send an SNI header so the connection will work fine. Also, if you change your Apache server configuration by adding a ServerName/ServerAlias that matches what the client sends, the server will no longer send the 1112 warning and you won't trigger the bug. This second workaround is the one I implemented.

So whose fault is it? One could argue this is an OpenSSL bug because they changed the protocol in an incompatible way between different OpenSSL versions, so old clients will return a new warning code when connecting with newer servers. But to me, the fault clearly lies with cURL. Every other client ignores the SNI warning. cURL is the only one that fails. The cURL author knows about the problem but does not seem interested in fixing it, insisting this is a bug in OpenSSL or Apache.

Who loses in this finger-pointing? People who use cURL and want to connect to web servers running newer versions of SSL. So it really makes no sense.

SSL is one of those fundamental things you assume "just works", and you shouldn't have to worry about version mismatch or interoperability. It is designed to be backward compatible and negotiate ciphers, protocol versions, certificates, etc. Unfortunately that is not the case. So a big ol' #fail to OpenSSL and cURL for screwing up SSL interoperability!

#SSL #OpenSSL #cURL #libcurl #fail #bug #SNI #Apache

This is what happens at MokaFive if you break the build.

Installing multiple versions of VMware on Ubuntu with schroot

MokaFive's CCO (Chief Canine Officer), Emma.

#mokafive

The case of the mystery data blocks

Let me tell you the story of the trickiest bug I've ever encountered at MokaFive.

We were having a problem where our virtual disk would detect block corruption, where a block signature would not match. And these were not just any corruption. We saw partial encrypted blocks. We saw unencrypted data. We saw executable code and snippets of text files. In short, something really strange was going on.

It didn’t help that this problem was very hard to reproduce. We had thousands of users using the product day in and day out and would get a new report every week or so. So we scoured our code for race conditions, we added a ton of invariant checking in the code, we built and ran every stress test we could think of, we read every block immediately after it was written to verify it, ran scans on the machine to check for host filesystem corruption, checked and double-checked for use-after-free and resource bugs, and a bunch of other stuff. This went on for a few months.

Eventually, we noticed a trend. First off, this was only happening on Windows 7 machines. Not Windows XP, not Mac. Second, the corruption always happened on 4K boundaries. And finally, after getting a copy of one of these corrupted disks, we had the biggest smoking gun. Our tdsk file had 4K blocks of data from a completely unrelated file on the host. This was a file that our processes could have never read. Not only that, but we found 4K blocks in the middle of our file that contained NTFS directory listing structures for unrelated directories on the host. So what were they doing in the middle of our file?

Eventually we pulled on that thread and after a ton of intimate time with ProcMon, we tracked it down to a bug in the Windows 7 kernel. More specifically, it was a race condition between NTFS and the Cache Manager. Essentially, a write issued (WriteFile, NtWriteFile, IRP_MJ_WRITE) for a file which has a pending flush (FlushBuffersFile, NtFlushBuffersFile, IRP_MJ_FLUSH_BUFFERS) can cause arbitrary host OS pages to be written to the target file in some circumstances.

The workaround was relatively easy but unfortunate – we now use a mutex to block writes to the file while a sync operation is occurring. This hurts concurrent performance but hey, data integrity is more important. We reported the bug to Microsoft and included a short program that easily reproduced the issue. Little did we know that the adventure was just beginning!

A non-privileged process can cause arbitrary operating system pages to be written to a file that it controls. In our minds, this is a really bad bug. Unprivileged processes can get access to secret data and files on the same system that they normally would not have access to. Even beyond the security considerations, being able to successfully write data to a file and not have it be corrupted seemed to be a pretty fundamental and important role of an operating system kernel.

However, Microsoft didn’t seem to agree, or at least, didn’t seem to understand the issue. Their responses were shocking.

Priceless comments from the Microsoft rep:

MS rep: Why would anyone do something like that? (simultaneous sync and write). Nobody would/should do that, so it’s not a serious issue.

Our response: We have a good reason for doing it. In any case you guys are the operating system, you should work no matter how we call the standard APIs and not allow me to access the content of random files! That’s a security issue.

MS rep: It is not a security issue because anyone who has data worth securing will encrypt it or overwrite it with zeroes.

Our response: (speechless). Ummmm, what? That’s news to us and every other software developer out there. You mean to say that it is expected that programs shouldn’t rely on Windows to enforce file system protections or permissions? Besides, overwriting a file with zeroes won't necessarily overwrite the data on the disk. What if the file moved due to defrag? Then the old data won't be overwritten.

MS rep: Sounds like a minor corner case. We will get around to it when we get around to it.

Now, to give Microsoft credit, they did eventually fix this bug, although not directly due to us (apparently some Microsoft Live software hit the same issue so they decided to actually fix it). It seemed to have been a race condition that was introduced in the run-up towards the Windows 7 release, because it did not happen on XP or Vista. Now, we reported this bug in November 2009 and the fix eventually rolled out as part of Windows 7 Service Pack 1, released in February 2011. So it only took them 14 months to get a fix released, and now we don’t need that mutex anymore (at least on systems with SP1 installed). So 3 months for us to find, 1 day to implement a workaround, and 14 months to wait for the real fix. Priceless.

And that was the trickiest bug I’ve encountered at MokaFive.

#MokaFive #Windows #kernel #debugging #Microsoft #exploit #NTFS #filesystem #bug

Trending Blogs

Recently Viewed Blogs

MokaFive