How to give a higher importance to certain features in a (k-means) clustering model?
$begingroup$
I am clustering data with numeric and categorical variables. To process the categorical variables for the cluster model, I create dummy variables. However, I feel like this results in a higher importance for these dummy variables because multiple dummy variables represent one categorical variable.
For example, I have a categorical variable Airport that will result in multiple dummy variables: LAX, JFK, MIA and BOS. Now suppose I also have a numeric Temperature variable. I also scale all variables to be between 0 and 1. Now my Airport variable seems to be 4 times more important than the Temperature variable, and the clusters will be mostly based on the Airport variable.
My problem is that I want all variables to have the same importance. Is there a way to do this? I was thinking of scaling the variables in a different way but I don't know how to scale them in order to give them the same importance.
machine-learning clustering feature-scaling dummy-variables
New contributor
$endgroup$
add a comment |
$begingroup$
I am clustering data with numeric and categorical variables. To process the categorical variables for the cluster model, I create dummy variables. However, I feel like this results in a higher importance for these dummy variables because multiple dummy variables represent one categorical variable.
For example, I have a categorical variable Airport that will result in multiple dummy variables: LAX, JFK, MIA and BOS. Now suppose I also have a numeric Temperature variable. I also scale all variables to be between 0 and 1. Now my Airport variable seems to be 4 times more important than the Temperature variable, and the clusters will be mostly based on the Airport variable.
My problem is that I want all variables to have the same importance. Is there a way to do this? I was thinking of scaling the variables in a different way but I don't know how to scale them in order to give them the same importance.
machine-learning clustering feature-scaling dummy-variables
New contributor
$endgroup$
add a comment |
$begingroup$
I am clustering data with numeric and categorical variables. To process the categorical variables for the cluster model, I create dummy variables. However, I feel like this results in a higher importance for these dummy variables because multiple dummy variables represent one categorical variable.
For example, I have a categorical variable Airport that will result in multiple dummy variables: LAX, JFK, MIA and BOS. Now suppose I also have a numeric Temperature variable. I also scale all variables to be between 0 and 1. Now my Airport variable seems to be 4 times more important than the Temperature variable, and the clusters will be mostly based on the Airport variable.
My problem is that I want all variables to have the same importance. Is there a way to do this? I was thinking of scaling the variables in a different way but I don't know how to scale them in order to give them the same importance.
machine-learning clustering feature-scaling dummy-variables
New contributor
$endgroup$
I am clustering data with numeric and categorical variables. To process the categorical variables for the cluster model, I create dummy variables. However, I feel like this results in a higher importance for these dummy variables because multiple dummy variables represent one categorical variable.
For example, I have a categorical variable Airport that will result in multiple dummy variables: LAX, JFK, MIA and BOS. Now suppose I also have a numeric Temperature variable. I also scale all variables to be between 0 and 1. Now my Airport variable seems to be 4 times more important than the Temperature variable, and the clusters will be mostly based on the Airport variable.
My problem is that I want all variables to have the same importance. Is there a way to do this? I was thinking of scaling the variables in a different way but I don't know how to scale them in order to give them the same importance.
machine-learning clustering feature-scaling dummy-variables
machine-learning clustering feature-scaling dummy-variables
New contributor
New contributor
New contributor
asked yesterday
EvaEva
363
363
New contributor
New contributor
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
$begingroup$
You cannot really use k-means clustering if your data contains categorical variables since k-means uses Euclidian distance which will not make a lot of sense with dummy-variables. Check out the answers to this similar question.
I would suggest, you switch to k-modes for your clustering algorithm. You will find good implementations both for Python and R.
$endgroup$
add a comment |
$begingroup$
Clearly the objective function uses a sum over the features.
So if you want to increase the importance of a feature, scale it accordingly. If you scale it by 2, the squares grow by 4. So you have increased the weight.
However, I would just not use k-means for one-hot variables. The mean is for continuous variables, minimizing the sum of squares on a one-hot variable has weird semantics.
$endgroup$
add a comment |
$begingroup$
You cannot use k-means clustering algorithm, if your data contains categorical variables and k-modes is suitable for clustering categorigal data. However, there are several algorithms for clustering mixed data, which actually are variationsmodifications of the basic ones.
Please check the following paper:
"Survey of State-of-the-Art Mixed Data Clustering Algorithms", Amir Ahmad and Sheorz Khan, 2019.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Eva is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f49381%2fhow-to-give-a-higher-importance-to-certain-features-in-a-k-means-clustering-mo%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
You cannot really use k-means clustering if your data contains categorical variables since k-means uses Euclidian distance which will not make a lot of sense with dummy-variables. Check out the answers to this similar question.
I would suggest, you switch to k-modes for your clustering algorithm. You will find good implementations both for Python and R.
$endgroup$
add a comment |
$begingroup$
You cannot really use k-means clustering if your data contains categorical variables since k-means uses Euclidian distance which will not make a lot of sense with dummy-variables. Check out the answers to this similar question.
I would suggest, you switch to k-modes for your clustering algorithm. You will find good implementations both for Python and R.
$endgroup$
add a comment |
$begingroup$
You cannot really use k-means clustering if your data contains categorical variables since k-means uses Euclidian distance which will not make a lot of sense with dummy-variables. Check out the answers to this similar question.
I would suggest, you switch to k-modes for your clustering algorithm. You will find good implementations both for Python and R.
$endgroup$
You cannot really use k-means clustering if your data contains categorical variables since k-means uses Euclidian distance which will not make a lot of sense with dummy-variables. Check out the answers to this similar question.
I would suggest, you switch to k-modes for your clustering algorithm. You will find good implementations both for Python and R.
edited yesterday
answered yesterday
georg_ungeorg_un
318111
318111
add a comment |
add a comment |
$begingroup$
Clearly the objective function uses a sum over the features.
So if you want to increase the importance of a feature, scale it accordingly. If you scale it by 2, the squares grow by 4. So you have increased the weight.
However, I would just not use k-means for one-hot variables. The mean is for continuous variables, minimizing the sum of squares on a one-hot variable has weird semantics.
$endgroup$
add a comment |
$begingroup$
Clearly the objective function uses a sum over the features.
So if you want to increase the importance of a feature, scale it accordingly. If you scale it by 2, the squares grow by 4. So you have increased the weight.
However, I would just not use k-means for one-hot variables. The mean is for continuous variables, minimizing the sum of squares on a one-hot variable has weird semantics.
$endgroup$
add a comment |
$begingroup$
Clearly the objective function uses a sum over the features.
So if you want to increase the importance of a feature, scale it accordingly. If you scale it by 2, the squares grow by 4. So you have increased the weight.
However, I would just not use k-means for one-hot variables. The mean is for continuous variables, minimizing the sum of squares on a one-hot variable has weird semantics.
$endgroup$
Clearly the objective function uses a sum over the features.
So if you want to increase the importance of a feature, scale it accordingly. If you scale it by 2, the squares grow by 4. So you have increased the weight.
However, I would just not use k-means for one-hot variables. The mean is for continuous variables, minimizing the sum of squares on a one-hot variable has weird semantics.
answered yesterday
Anony-MousseAnony-Mousse
5,195625
5,195625
add a comment |
add a comment |
$begingroup$
You cannot use k-means clustering algorithm, if your data contains categorical variables and k-modes is suitable for clustering categorigal data. However, there are several algorithms for clustering mixed data, which actually are variationsmodifications of the basic ones.
Please check the following paper:
"Survey of State-of-the-Art Mixed Data Clustering Algorithms", Amir Ahmad and Sheorz Khan, 2019.
$endgroup$
add a comment |
$begingroup$
You cannot use k-means clustering algorithm, if your data contains categorical variables and k-modes is suitable for clustering categorigal data. However, there are several algorithms for clustering mixed data, which actually are variationsmodifications of the basic ones.
Please check the following paper:
"Survey of State-of-the-Art Mixed Data Clustering Algorithms", Amir Ahmad and Sheorz Khan, 2019.
$endgroup$
add a comment |
$begingroup$
You cannot use k-means clustering algorithm, if your data contains categorical variables and k-modes is suitable for clustering categorigal data. However, there are several algorithms for clustering mixed data, which actually are variationsmodifications of the basic ones.
Please check the following paper:
"Survey of State-of-the-Art Mixed Data Clustering Algorithms", Amir Ahmad and Sheorz Khan, 2019.
$endgroup$
You cannot use k-means clustering algorithm, if your data contains categorical variables and k-modes is suitable for clustering categorigal data. However, there are several algorithms for clustering mixed data, which actually are variationsmodifications of the basic ones.
Please check the following paper:
"Survey of State-of-the-Art Mixed Data Clustering Algorithms", Amir Ahmad and Sheorz Khan, 2019.
answered yesterday
Christos KaratsalosChristos Karatsalos
54719
54719
add a comment |
add a comment |
Eva is a new contributor. Be nice, and check out our Code of Conduct.
Eva is a new contributor. Be nice, and check out our Code of Conduct.
Eva is a new contributor. Be nice, and check out our Code of Conduct.
Eva is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f49381%2fhow-to-give-a-higher-importance-to-certain-features-in-a-k-means-clustering-mo%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown