TSX sure, but AVX isn't about multithreading, its about widening vectors within a single thread- which is, frankly, a lot more tricky to squeeze performance out of. Ganging together more and more operations to perform simultaneously is a LOT harder to do than to just have multiple independent threads in flight.
That's true, nevertheless people manage. They switched to SSE2 from MMX/iSSE, GPGPU has been sucessfully applied to many problem domains, and from initial reports Xeon Phi was well-received by at least some HPC programmers.
Of course you often loose efficiency when porting to a wider SIMD architecture, but that can't be helped. But with efficient data transformation instructions (powerful permute, gather) you can reduce the inefficiencies. That's different from the early days of SSE2 when shuffle instructions were extremely slow due to the lack of a full-width crossbar. And gather should actually allow for vectorization of some algorithms which were not practically vectorizable before.
I kind of wish Intel would do something like what AMD has done- let their vector units do either a single 256 bit op per cycle, or do two 128 bit ops per cycle. It gets even more mental when you go to the Xeon Phi- 512 bit vectors!
This approach doesn't help if you want to increase performance. Let's look at a hypothetical Haswell which aims for the same vector throughput as the actual Haswell, but has 128-bit execution pipes.
You need the following pipes: 4xFMA, 2xVecMisc, 4xLoad, 2xStoreData, 2xStoreAddress = 14 pipes in total. The design needs to be 8-wide to feed the pipes. But of course you not only need an 8-wide x86 decoder and the 14-port backend, but you also need to increase the instruction cache, fetch bandwidth, instruction decode queue, reorder buffer, reservation station, data cache ports, PRF ports and the result forwarding network. With such a wide design you'll probably have to move to clustered execution and add a 2nd PRF. And implement 4-way SMT in order to justify the other expenses. You'll need a deeper pipeline. These changes will increase instruction latencies, probably at least by 2 cycles. Transistor count per core will go up by 25-40%, and you'll loose maybe 30% clockspeed at the same TDP due to the complexity of the chip. Oh, and development/validation costs will double to triple. Also, this chip won't hit the market until 2015, again due to the complexity.
And when you now run your 128-bit code on this chip, you'll find that most programs run as fast as before or even slower because you're now running into ILP issues in addition to the higher latencies and lower clocks. All this on a CPU which is twice as expensive.
Of course you could do a design which is "only" 6-wide and has something like 10-12 execution ports, but such a design would be unbalanced in the ratio of execution and L/S ports, you'd run into bottlenecks and could never fully use your execution resources, but you'd still have to pay a high price for the additional complexity.
So if you consider all the related problems, Intels current designs are probably pretty close to giving optimal vector performance for a mainstream OoO x86 design given the current transistor budget and process limitations. We'll just have to deal with the design-time DLP problems, because there simply is no other practical solution.