DISCLAIMER: i'm just a lowly CompSci graduate, please don't flame me
a single register should be the size of a "Word". if i remember my "intro to EE" class correctly then today it would typically be 64 bits wide.
a register file would have many registers, i have no idea how many though, and the L0$ would have many register files.
registers are the fastest type of storage on the CPU, and therefore cost the most.
that's why you have several layers of "storage" in a computer, ranked according to "access time" from fastest to slowest:
registers, cache, RAM, HDD, physical media, networks, etc' etc'...
the problem with the top 3 types of storages i mentioned is that they are volatile.
meaning they lose their data once current stops flowing through them (you turn off the computer). so RAM for instance, will never replace HDDs no matter how much RAM we have, unless you want to go back to the 640k era where you had to reload everything onto your memory every time you turned on the computer.
but if we had Memristor RAM....*droooool*....(Wiki it)
for your second question, before an ALU can perform an operation on data from the CPU bus, it must be first copied from the bus to the ALUs operand register(s).
so an ADD operation would look something like this:
1) take the bits from the A operand
2) add them to the bits in the B operand, calculate carry over and all.
3) store the result in the C (or output) operand.
the C register is then "released" (dumped) into the bus for the next operation.
does this answer your questions?