Hey there. I have machine vision and image processing experience. Some relevant projects include vision-based tide and wave extrapolation, autonomous camera network calibration for 3D scene capturing, vision-based avionics, geolocation via skyline and horizon recognition, and cellular microscopy recognition in low signal to noise ratio images. It's been years, but I have the background.
What you're describing actually is possible using current techniques and equipment, but it won't work as you've described it. Let's go on a quick mental adventure.
There is no reason to build a wireframe; that's a human-ism, probably in line with what you'd see in a movie. A computer does not know or care about wire frames, and neither should you. Your goal is to automatically grade the quality of the swimmer's technique.
A computer cares about data and algorithms. Feed the computer a bunch of data, run it through an algorithm, and it gives you a result. The result could be a simple "yes" or "no," but it could also even be another algorithm altogether. Those results could be combined using another algorithm to get another result. And so on. Basically, the point is, computers are very good and taking in a bunch of data, then crunching it.
Now put yourself in that mentality. From that perspective, a camera is not a visual device; it's a two-dimensional data capture device. A video camera is a three dimensional data capture device (X, Y, and time). Now think about 1080p video at 30 frames per second for 10 seconds; that's over 600 million pixels of information, each of which has 6 dimensions (red, green, blue, X coordinate, y coordinate, time). Tons of data!
Alone, the data is meaningless. To teach the computer to "see" points of interest, we need to teach it to identify patterns in the data. Various algorithms are used to represent the same data in a different way. For example, I could describe a picture of a swimmer in a pool by listing out every single pixel, or I could say "dark blob inside of a light blue blob." Of course, the latter option makes us lose a ton of data, but it still conveys information. Information and data are not the same. There are literally thousands of features we could calculate for a single frame of video, ranging from "there is a lot of blue" to a statistical distribution of a histogram of gradients. By this point, we've gone from having an insurmountable amount of useless data to an insurmountable amount of potentially useful information. There's a big problem, though. If you asked the computer to tell you how the swimmer is doing by this stage, it would give you over a thousand random descriptions that, while all are true, you have no realistic way of knowing which ones actually matter.
Each piece of extracted information is called a feature. The collection of all features is called a feature vector. The dimensionality of the feature vector is literally just the number of features. What we need to do is shrink the thousands of dimensions down to just two or three. This process is a combination of three major areas: feature selection, dimensionality reduction, and modeling. The first one is dead simple: figure out which features you just want to blatantly ignore or remove. The last one is not so simple, but pretty straight forward: come up with a formula that takes in the features and gives you a quality of the swimmer's technique. But what about that ridiculously pompous sounding one in the middle?
Dimensionality reduction is the key to solving low SNR (signal to noise ratio, or "quality") machine vision problems. A swimmer that is moving in water that may be fully or partially submerged and may have all sorts of varying lighting conditions is definitely a low quality image. By contrast, you could say that scanning a text document is a high SNR image, since there are basically just two colors and no variance in lighting. So how does it work and how do we do it?
Think about a cube. That's easy to do, as it's only 3 dimensions. Now think about a 1000 dimension object. It ties your brain in a knot, right? But it's actually fairly simple in concept (in practice, you better have deep pockets for consulting fees). Draw a cube on a piece of paper. You will notice that despite the picture being only two dimensions, you can comprehend a three dimensional object. Another way of stating that is that you've represented 3 dimensional data with 2 dimensional information. Notice the distinction between data and information, again. So if you can represent 3D data with 2D information, couldn't you represent 4D data with 3D information? And so on?
The answer is yes. You can represent that 1000-dimension feature vector with 999 dimensions. You can keep going all the way down to 2 or 3 if you wanted to. You would be amazed as to how we can represent the scene. Expand your definition of the real world for a moment. Is the swimmer moving within the water, or is he using the water to move the pool around him? They're two different ways of explaining the same thing. But since we don't care about the pool and only care about the swimmer, we can use the latter representation. By doing this, we eliminate the dimension of "position", since now through this perspective, the swimmer suddenly stops moving and the idea of a position is no longer relevant. Keep up this mentality, only you need to do it mathematically.
Eventually, you will arrive at a smaller, manageable feature vector. You could then plug those numbers into a formula that gives you a quality rating for the technique of the swimmer. Figuring out this formula is called modeling. The short version is that we have known shapes, known formulas for those shapes, and a history of known swimmer data. Now find the shape that most closely matches your historic swimmer data, and there's your model. Plug the numbers in and it'll tell you how close it is to the theoretically perfect shape. The closer it is, the better the swimmer's technique.
Recap:
Collect data (video), process it into information, reduce the information to the most important components, and fit it to a model.
There you go! You have an automated swim coach!
Easy, right?