Why do data scientists use Sklearn's StandardScaler and what does it do?
Been doing some Machine Learning “learning” in the past two weeks. I’m interested in learning to use TensorFlow.js at a decent level and solve a personal problem โ software estimation and road-mapping, for the teams I work with. This has led me down the darkest alleys of Machine Learning blogs.
Most of the relevant material I found uses TensorFlow but with Python. I understand why data scientists choose Python over other languages. Simple syntax, easy to grasp, amazing array/collection operations โ those array comprehensions are gold when it comes to making sense of data and transforming it in one line of code.
In any case, I found it hard to understand why almost everyone used sklearn.preprocessing.StandardScaler before any ML smarts. Until I finally decided to look up the documentation.
The gist
It turns out that standardizing your data is much more than having the number of row items in your CSV equal the number of labels (headers) the CSV has. Standardization is a bit more than that, and as the documentation states, “it is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data…”.
This means that before you start training or predicting on your dataset, you first need to eliminate the “oddballs”. You need to remove values that aren’t centered around 0, because they might throw off the learning your algorithm is doing.
Need to find something that does this in JavaScript. I bet it exists.