Thursday, December 1, 2016

45.5.1 available, and 32-bit Intel Macs go Tier-3

Test builds for 45.5.1, with the single change being the safety fix for the Firefox 0-day in bug 1321066 (CVE-2016-9079), are now available. Release notes and hashes to follow when I'm back from my business trip late tonight. I will probably go live on this around the same time, so please test as soon as you can.

In other news, the announcement below was inevitable after Mozilla dropped support for 10.6 through 10.8, but for the record (from BDS):

As of Firefox 53, we are intending to switch Firefox on mac from a universal x86/x86-64 build to a single-architecture x86-64 build.

To simplify the build system and enable other optimizations, we are planning on removing support for universal mac build from the Mozilla build system.

The Mozilla build and test infrastructure will only be testing the x86-64 codepaths on mac. However, we are willing to keep the x86 build configuration supported as a community-supported (tier 3) build configuration, if there is somebody willing to step forward and volunteer as the maintainer for the port. The maintainer's responsibility is to periodically build the tree and make sure it continues to run.

Please contact me directly (not on the list) if you are interested in volunteering. If I do not hear from a volunteer by 23-December, the Mozilla project will consider the Mac-x86 build officially unmaintained.

The precipitating event for this is the end of NPAPI plugin support (see? TenFourFox was ahead of the curve!), except, annoyingly, Flash, with Firefox 52. The only major reason 32-bit Mac Firefox builds weren't ended with the removal of 10.6 support (10.6 being the last version of Mac OS X that could run on a 32-bit Intel Mac) was for those 64-bit Macs that had to run a 32-bit plugin. Since no plugins but Flash are supported anymore, and Flash has been 64-bit for some time, that's the end of that.

Currently we, as OS X/ppc, are a Tier-3 configuration also, at least for as long as we maintain source parity with 45ESR. Mozilla has generally been deferential to not intentionally breaking TenFourFox and the situation with 32-bit x86 would probably be easier than our situation. That said, candidly I can only think of two non-exclusive circumstances where maintaining the 32-bit Intel Mac build would be advantageous, and they're both bigger tasks than simply building the browser for 32 bits:

  • You still have to run a 32-bit plugin like Silverlight. In that case, you'd also need to undo the NPAPI plugin block (see bug 1269807) and everything underlying it.
  • You have to run Firefox on a 32-bit Mac. As a practical matter this would essentially mean maintaining support for 10.6 as well, roughly option 4 when we discussed this in a prior blog post with the added complexity of having to pull the legacy Snow Leopard support forward over a complete ESR cycle. This is non-trivial, but hey, we've done just that over six ESR cycles, although we had the advantage of being able to do so incrementally.

I'm happy to advise anyone who wants to take this on but it's not something you'll see coming from me. If you decide you'd like to try, contact Benjamin directly (his first name, smedbergs, us).

Tuesday, November 29, 2016

45.5.1 chemspill imminent

The plan was to get you a test build of TenFourFox 45.6.0 this weekend, but instead you're going to get a chemspill for 45.5.1 to fix an urgent 0-day exploit in Firefox which is already in the wild. Interestingly, the attack method is very similar to the one the FBI infamously used to deanonymise Tor users in 2013, which is a reminder that any backdoor the "good guys" can sneak through, the "bad guys" can too.

TenFourFox is technically vulnerable to the flaw, but the current implementation is x86-based and tries to attack a Windows DLL, so as written it will merely crash our PowerPC systems. In fact, without giving anything away about the underlying problem, our hybrid-endian JavaScript engine actually reduces our exposure surface further because even a PowerPC-specific exploit would require substantial modification to compromise TenFourFox in the same way. That said, we will still implement the temporary safety fix as well. The bug is a very old one, going back to at least Firefox 4.

Meanwhile, 45.6 is going to be scaled back a little. I was able to remove telemetry from the entire browser (along with its dependencies), and it certainly was snappier in some sections, but required wholesale changes to just about everything to dig it out and this is going to hurt keeping up with the ESR repository. Changes this extensive are also very likely to introduce subtle bugs. (A reminder that telemetry is disabled in TenFourFox, so your data is never transmitted, but it does accumulate internal counters and while it is rarely on a hot codepath there is still non-zero overhead having it around.) I still want to do this but probably after feature parity, so 45.6 has a smaller change where telemetry is instead only removed from user-facing chrome JavaScript. This doesn't help as much but it's a much less invasive change while we're still on source parity with 45ESR.

Also, tests with the "non-volatile" part of IonPower-NVLE showed that switching to all, or mostly, non-volatile registers in the JavaScript JIT compiler had no obvious impact on most benchmarks and occasionally was a small negative. Even changing the register allocator to simply favour non-volatile registers, without removing volatiles, had some small regressions. As it turns out, Ion actually looks pretty efficient with saving volatile registers prior to calls after all and the overhead of having to save non-volatile registers upon entry apparently overwhelms any tiny benefit of using them. However, as a holdover from my plans for NVLE, we've been saving three more non-volatile general purpose registers than we allow the allocator to use; since we're paying the overhead to use them already, I added those unused registers to the allocator and this got us around 1-2% benefit with no regression. That will ship with 45.6 and that's going to be the extent of the NVLE project.

On the plus side, however, 45.6 does have HiDPI support completely removed (because no 10.6-compatible system has a retina display, let alone any Power Mac), which makes the widget code substantially simpler in some sections, and has a couple other minor performance improvements, mostly to scrolling on image-heavy pages, and interface fixes. I also have primitive performance sampling working, which is useful because of a JavaScript interpreter infinite loop I discovered on a couple sites in the wild (and may be the cause of the over-recursion problems I've seen other places). Although it's likely Mozilla's bug and our JIT is not currently implicated, it's probably an endian issue since it doesn't occur on any Tier-1 platform; fortunately, the rough sampler I threw together was successfully able to get a sensible callstack that pointed to the actual problem, proving its functionality. We've been shipping this bug since at least TenFourFox 38, so if I don't have a fix in time it won't hold the release, but I want to resolve it as soon as possible to see if it fixes anything else. I'll talk about my adventures with the mysterious NSSampler in a future post soonish.

Watch for 45.5.1 over the weekend, and 45.6 beta probably next week.

Saturday, November 12, 2016

45.5.0 final available

The final release of TenFourFox 45.5.0 (downloads, hashish, er, hashes, release notes) is available. Pretty much everything made it, including the hybrid-endian JavaScript engine (the LE portion of IonPower-NVLE), the AltiVec VP9 IDCT/IADST/IHT transformations, the MP3 refactoring and the new custom in-browser prefpane. There is also a fix for PostScript-based front blocking which apparently glitched in 45. Assuming all goes well and there are no major regressions, this will go live either late Sunday or early Monday due to a planned power outage which will affect Floodgap on Tuesday.

Meanwhile, I still don't have a good understanding of what's wrong with Amazon Music (still works great in 38.10), nor the issue with some users being unable to make changes to their default search engine stick. This is the problem with a single developer, folks: what I can't replicate I can't repair. I have a couple other theories in that thread for people to respond to.

Next up will be actually ripping some code out for a change. I'm planning to completely eviscerate telemetry support since we have no infrastructure to manage it and it's wasted code, as well as retina Mac support, since no retina Mac can run 10.6. I don't anticipate these being major speed boosts but they'll help and they'll make the browser smaller. Since we don't have to maintain good compatibility with Mozilla source code anymore I have some additional freedom to do bigger surgeries like these. I'll also make a first cut at the non-volatile portion of IonPower-NVLE by making floating point registers in play non-volatile (except for the volatiles like f1 that the ABI requires to be live also); again, not a big boost, but it will definitely reduce stack pressure and should improve the performance of ABI-compliant calls. User agent switching and possibly some more AltiVec VP9 work are also on the table, but may not make 45.6.

The other thing that needs to be done is restoring our ability to do performance analysis because Shark and Sample on 10.4 freak out trying to resolve symbols from these much more recent gcc builds. The solution would seem to be a way to get program counter samples without resolving them, and then give that to a tool like addr2line or even gdb7 itself to do the symbol resolution instead, but I can't find a way to make either Shark or Sample not resolve symbols. Right now I'm disassembling /usr/bin/sample (since Apple apparently doesn't offer the source code for it) to see how it gets those samples and it seems to reference a mysterious NSSampler in the CHUD VM tools private framework. Magic Hat can dump the class but the trick is how to work with it and which selectors it will allow. More on that later.

Tuesday, November 8, 2016

Happy 6th birthday, TenFourFox

Today's the American election and no matter what, some of you are going to be delighted, some of you are going to be disappointed, and a few of you are going to be really steamed. But no matter what your perspective, we can all agree it's a good thing today is the sixth anniversary of TenFourFox's first beta release, 4.0b7, on the 8th of November 2010. Yes, we're six years old today! And by golly, we act like it!

Hail to the Chief!

Tuesday, November 1, 2016

Debian drops powerpc

This blog isn't generally or at least currently concerned with Linux/ppc happenings, primarily because that isn't my personal area of expertise and I use OS X/MacOS on almost all of my Power Macs, but this is rather major news that needs to be distributed a little more widely: Debian is dropping big-endian ppc and ppc64 in Stretch/Debian 9. The decision seems to be based on insufficient port maintainers. Because Ubuntu is based on Debian, the same should be expected in their next release, as well as any of the other Debian derivatives.

Power Architecture is not going away from Debian; they will still support little-endian 64-bit PowerPC, better known as ppc64el, and coincidentally the same architecture as the Raptor Talos which will run a wide choice of Linux distributions and probably some *BSDs. (It's up to $228K of $3.7M. I'm already in for a board and CPU. Back now!) And do note that Debian has stopped support for current architectures before like SPARC, so such a move is hardly unprecedented. But this is really bad news for our PowerPC friends in Amiga-land, and may seriously disturb the viability of boutique systems such as the AmigaOne X5000 -- AmigaOS is a lovely OS but it lacks the range of Linux on that hardware, and Debian was one of the lowest barriers to entry.

This doesn't mean some insane energetic freak like me couldn't take Debian/ppc and keep it rolling, but they'll have to step up now and a lot of work is in store. If you want a supported Linux on Power, however, you're gonna have to go POWER8. Otherwise, it might be a good time to check out NetBSD.

Thursday, October 27, 2016

Apple desktop users screwed again

Geez, Tim. You could have at least phoned in a refresh for the mini. Instead, we get a TV app and software function keys. Apple must be using the Mac Pro cases as actual trash cans by now.

Siri, is Phil Schiller insane?

That's a comfort.

(*Also, since my wife and I both own 11" MacBook Airs and like them as much as I can realistically like an Intel Mac, we'll mourn their passing.)

Thursday, October 20, 2016

We need more desktop processor branches

Ars Technica is reporting an interesting attack that uses a side-channel exploit in the Intel Haswell branch translation buffer, or BTB (kindly ignore all the political crap Ars has been posting lately; I'll probably not read any more articles of theirs until after the election). The idea is to break through ASLR, or address space layout randomization, to find pieces of code one can string together or directly attack for nefarious purposes. ASLR defeats a certain class of attacks that rely on the exact address of code in memory. With ASLR, an attacker can no longer count on code being in a constant location.

Intel processors since at least the Pentium use a relatively simple BTB to aid these computations when finding the target of a branch instruction. The buffer is essentially a dictionary with virtual addresses of recent branch instructions mapping to their predicted target: if the branch is taken, the chip has the new actual address right away, and time is saved. To save space and complexity, most processors that implement a BTB only do so for part of the address (or they hash the address), which reduces the overhead of maintaining the BTB but also means some addresses will map to the same index into the BTB and cause a collision. If the addresses collide, the processor will recover, but it will take more cycles to do so. This is the key to the side-channel attack.

(For the record, the G3 and the G4 use a BTIC instead, or a branch target instruction cache, where the table actually keeps two of the target instructions so it can be executing them while the rest of the branch target loads. The G4/7450 ("G4e") extends the BTIC to four instructions. This scheme is highly beneficial because these cached instructions essentially extend the processor's general purpose caches with needed instructions that are less likely to be evicted, but is more complex to manage. It is probably for this reason the BTIC was dropped in the G5 since the idea doesn't work well with the G5's instruction dispatch groups; the G5 uses a three-level hybrid predictor which is unlike either of these schemes. Most PowerPC implementations also have a return address stack for optimizing the blr instruction. With all of these unusual features Power ISA processors may be vulnerable to a similar timing attack but certainly not in the same way and probably not as predictably, especially on the G5 and later designs.)

To get around ASLR, an attacker needs to find out where the code block of interest actually got moved to in memory. Certain attributes make kernel ASLR (KASLR) an easier nut to crack. For performance reasons usually only part of the kernel address is randomized, in open-source operating systems this randomization scheme is often known, and the kernel is always loaded fully into physical memory and doesn't get swapped out. While the location it is loaded to is also randomized, the kernel is mapped into the address space of all processes, so if you can find its address in any process you've also found it in every process. Haswell makes this even easier because all of the bits the Linux kernel randomizes are covered by the low 30 bits of the virtual address Haswell uses in the BTB index, which covers the entire kernel address range and means any kernel branch address can be determined exactly. The attacker finds branch instructions in the kernel code such as by disassembling it that service a particular system call and computes (this is feasible due to the smaller search space) all the possible locations that branch could be at, creates a "spy" function with a branch instruction positioned to try to force a BTB collision by computing to the same BTB index, executes the system call, and then executes the spy function. If the spy process (which times itself) determines its branch took longer than an average branch, it logs a hit, and the delta between ordinary execution and a BTB collision is unambiguously high (see Figure 7 in the paper). Now that you have the address of that code block branch, you can deduce the address of the entire kernel code block (because it's generally in the same page of memory due to the typical granularity of the randomization scheme), and try to get at it or abuse it. The entire process can take just milliseconds on a current CPU.

The kernel is often specifically hardened against such attacks, however, and there are more tempting targets though they need more work. If you want to attack a user process (particularly one running as root, since that will have privileges you can subvert), you have to get your "spy" on the same virtual core as the victim process or otherwise they won't share a BTB -- in the case of the kernel, the system call always executes on the same virtual core via context switch, but that's not the case here. This requires manipulating the OS' process scheduler or running lots of spy processes, which slows the attack but is still feasible. Also, since you won't have a kernel system call to execute, you have to get the victim to do a particular task with a branch instruction, and that task needs to be something repeatable. Once this is done, however, the basic notion is the same. Even though only a limited number of ASLR bits can be recovered this way (remember that in Haswell's case, bit 30 and above are not used in the BTB, and full Linux ASLR uses bits 12 to 40, unlike the kernel), you can dramatically narrow the search space to the point where brute-force guessing may be possible. The whole process is certainly much more streamlined than earlier ASLR attacks which relied on fragile things like cache timing.

As it happens, software mitigations can blunt or possibly even completely eradicate this exploit. Brute-force guessing addresses in the kernel usually leads to a crash, so anything that forces the attacker to guess the address of a victim routine in the kernel will likely cause the exploit to fail catastrophically. Get a couple of those random address bits outside the 30 bits Haswell uses in the BTB table index and bingo, a relatively simple fix. One could also make ASLR more granular to occur at the function, basic block or even single instruction level rather than merely randomizing the starting address of segments within the address space, though this is much more complicated. However, hardware is needed to close the gap completely. A proper hardware solution would be to either use most or all of the virtual address in the BTB to reduce the possibility of a collision, and/or to add a random salt to whatever indexing or hashing function is used for BTB entries that varies from process to process so a collision becomes less predictable. Either needs a change from Intel.

This little fable should serve to remind us that monocultures are bad. This exploit in question is viable and potentially ugly but can be mitigated. That's not the point: the point is that the attack, particularly upon the kernel, is made more feasible by particular details of how Haswell chips handle branching. When everything gets funneled through the same design and engineering optics and ends up with the same implementation, if someone comes up with a simple, weapons-grade exploit for a flaw in that implementation that software can't mask, we're all hosed. This is another reason why we need an auditable, powerful alternative to x86/x86_64 on the desktop. And there's only one system in that class right now.

Okay, okay, I'll stop banging you over the head with this stuff. I've got a couple more bugs under investigation that will be fixed in 45.5.0, and if you're having the issue where TenFourFox is not remembering your search engine of choice, please post your country and operating system here.