Amazon AWS Certified Machine Learning Specialty – Exploratory Data Analysis

  • By
  • January 25, 2023
0 Comment

17. Binning, Transforming, Encoding, Scaling, and Shuffling

Let’s quickly go through some other techniques you might use in the process of feature engineering. One is called binning. The idea here is just to take your numerical data and transform it into categorical data by binning these values together based on ranges of values. So as an example, maybe I have the ages of people in my data set. I might put everyone in their 20s into one bucket, everyone in their 30s into another bucket, and so on and so forth.

That would be an example of binning where I’m just putting everyone in a given range into a certain category. So instead of saying that I’m going to train based on the fact that you’re 22 and three months old, I’m just going to bucket you into the bin of 20 year olds, right? So I’ve changed that number of 22 point whatever it is, into a category of 20 somethings. So that’s all bidding is. Why would you want to do that? Well, there’s a few reasons. One is that sometimes you have some uncertainty in your measurements.

So maybe your measurements aren’t exactly precise and you’re not actually adding any information by saying this person is 22. 37 years old versus 22. 38 years old. Maybe some people remembered the wrong birthday or something, or you asked them on different days and you got different values as a result. So binning is a way of covering up imprecision in your measurements.

That’s one reason. Another reason might be that you just really want to use a model that works on categorical data instead of numerical data. That’s kind of a questionable thing to be doing because you’re basically throwing some information away by binning, right? So if you’re doing that, you should think hard about why you’re doing that. The only really legitimate reason to do this is if there is uncertainty or errors in your actual underlying measurements that you’re trying to get rid of. There’s also something called quantile binning that you should understand. The nice thing about quantile binning is that it categorizes your data by their place in the data distribution.

So it ensures that every one of your bins has an equal number of samples within them. So with quantile binning, I make sure that I have my data distributed in such a way that I have the same number of samples in each resulting bin. Sometimes that’s a useful thing to do. So remember, quantile binning will have even sizes in each bin. Another thing we might do is transforming our data, applying some sort of a function to our features to make it better suited for our algorithms.

So for example, if you have featured data that has an exponential trend within it, that might benefit from doing a logarithmic transform on it to make that data look more linear, that might help out your model and actually finding real trends in it. Sometimes models have difficulty with nonlinear data coming into it. A real world example is YouTube they published a paper on how their recommendations work, which is great reading, by the way. There’s a reference to that in the slide here.

They have a whole section on feature engineering there that you might find useful. And one thing they do is for any numeric feature x that they have, for example, how long has it been since you watched a video? They also feed in the square of that and the square root of it. And the idea there is that they can learn super and sub linear functions in the underlying data that way. So they’re not just throwing in raw values, they’re also throwing in the square and the square root. Just to be careful and see if there actually are nonlinear trends there that they should be picking up on. They found that that actually improved their results. So that’s an example of transforming data. It’s not necessarily replacing data with a transformation. Sometimes you’re actually creating a new feature from transforming an existing one. That’s what’s going on here. So they’re feeding in both the original feature, x and x squared and the square root of x. You can see in this graph here why you might want to do that.

So if I’m starting off with a function of x here on the green line, you can see that by taking the Ln, the logarithm of that, I end up with a linear relationship instead, which might be easier for miles to pick up on. I could also raise that to a higher power, which would actually make things worse in this case, but sometimes more data is better. Again, we’re talking about the cursed dimensionality, so there is a limit to that. But that’s what feature engineering is all about, trying to find that balance between having just enough information and too much information. Another very common thing you’ll do while preparing your data is encoding. And you see this a lot in the world of deep learning.

So a lot of times your model will require a very specific kind of input, and you have to transform your data and encode it into the format that your model requires. A very common example is called one hot encoding. Okay? So make sure you understand how this works. The idea is that I create a bucket for every category that I have, and basically I have a one that represents that that category exists, and a zero that represents that it’s not that category.

Let’s look at this picture as an example. Let’s say that I’m building a deep learning model that tries to do handwriting recognition on people, drawing the numbers zero through nine. This is a very common example that we’ll look at more later. So to One hot encode this information, I know that this thing represents the number eight. And to represent that in a one hot encoded manner, basically I have ten different buckets for every possible digit that that might represent 012-34-5678 or nine now, I usually start counting at zero here.

So you can see here that in the 9th slot there, there’s a one that represents the number eight, and every other slot there has a zero representing that. It is not that category. That’s all one hot encoding is. So again, if I had a one in that first slot that would represent the number zero. If I had a one in the second slot that would represent the number one, and so on and so forth.

We do this because in deep learning, neurons generally are either on or off or activated or they’re not activated. So I can’t just feed in the number eight or the number one into an input neuron and expect it to work. That’s not how these things operate. Instead, I need to have this one hot encoding scheme where every single training value, that label is actually going to be fed into ten different input neurons where only one of them represents the actual category I have. So stare at that picture a little bit, make sure you understand it. If you’re not familiar with onehot encoding, that is probably something you’ll see on the exam. We can also talk about scaling and normalizing your data.

Again, pretty much every model requires this as well. A lot of models prefer their feature data to be normally distributed around zero, and this is also true of most deep learning and neural networks. And at a minimum, most models will require that your feature data is at least scaled to comparable values. There are models out there that don’t care so much, such as decision trees, but most of them will be sensitive to the scale of your input data. Otherwise, if you have features that have larger magnitudes, they’ll end up having more weight on your model than they should. Going back to the example of people if I’m trying to train a system based on their income, which might be some very large number, like 50,000, and also their age, which is a relatively small number like 30 or 40. If I weren’t normalizing that data down to comparable ranges before training on it, that income would have a much higher impact on the model than their ages. And that’s going to result in a model that doesn’t do a very good job.

Now, it’s very easy to do this, especially with ScikitLearn in Python. It has a pre processor module that helps you out with this sort of a thing. It has something called Min Max Scalar that will do it for you very easily. The only thing is you have to remember to scale your results back up if what you’re predicting is not just categories and actual numeric data.

So sometimes if you’re predicting something, you have to make sure to reapply that scaling and reverse to actually get a meaningful result out of your model at the end of the day. Finally, we will talk about shuffling a lot of algorithms benefit from shuffling your training data. Otherwise, sometimes there’s sort of a residual signal in your training data resulting from the order in which that data was collected. So you want to make sure you’re eliminating any byproducts of how the data was actually collected by shuffling it and just randomizing the order that’s fed into your model. So often that makes a difference in the quality as well. There are a lot of stories I’ve seen where someone got a really bad result out of their machine learning model, but by just shuffling the input, things got a lot better. So don’t forget to do that as well. And that’s the world of feature engineering in a nutshell.

Comments
* The most recent comment are at the top

Interesting posts

Preparing for Juniper Networks JNCIA-Junos Exam: Key Topics and Mock Exam Resources

So, you’ve decided to take the plunge and go for the Juniper Networks JNCIA-Junos certification, huh? Great choice! This certification serves as a robust foundation for anyone aiming to build a career in networking. However, preparing for the exam can be a daunting task. The good news is that this guide covers the key topics… Read More »

Mastering Microsoft Azure Fundamentals AZ-900: Essential Study Materials

Ever wondered how businesses run these days without giant server rooms? That’s the magic of cloud computing, and Microsoft Azure is a leading cloud platform. Thinking about a career in this exciting field? If so, mastering the Microsoft Certified: Azure Fundamentals certification through passing the AZ-900 exam is the perfect starting point for you. This… Read More »

The Impact of Remote Work on IT Certification Exam Processes

With remote work becoming the new norm, it’s not just our daily routines that have changed but also how we tackle IT certification exams. Gone are the days of trekking to testing centers; now, your living room can double as an exam room. This shift has brought about some fascinating changes and challenges. Let’s dive… Read More »

IT Risk Management: CRISC Certification Exam Essentials

Do you ever feel like the IT world is moving at warp speed? New tech seems to pop up every day, leaving you wondering how to keep up and truly stand out in your field. Companies are increasingly concerned about online threats, data leaks, and meeting legal requirements. That’s where the CRISC (Certified in Risk… Read More »

The Ultimate Guide to Mastering Marketing Automation for Email Wizards

Hey there, email aficionados! Welcome to your new favorite read – the one that’s going to turbocharge your email marketing game. You’re about to dive into the captivating world of marketing automation, a place where efficiency meets effectiveness, letting you boost your campaigns without breaking a sweat. Get ready to discover how automation can not… Read More »

Master YouTube Marketing with These 10 Powerful Steps

Welcome to the dynamic world of YouTube marketing! Whether you’re a seasoned pro or just getting started, harnessing the power of YouTube can significantly boost your brand’s visibility and engagement. With over 2 billion monthly active users, YouTube offers a vast audience for your content. But how do you stand out in such a crowded… Read More »

sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |