Algorithm to convert a fixed-length string to the smallest possible collision-free representation?
$begingroup$
I have a US-based telephone number (in the format 000-000-0000) and need to convert it to a "shorter" representation. Using base32/64 produces too long of a string. I've discovered CRC-16, which produces a string of length 4, however, I'm not sure it's a suitable tool for the job (I want to prevent collisions).
Basically, if the input is 000-000-0000 (without the dashes), I want the output to be something like 4xg4. In this use case, the advantage is that I have 10 digits, thus I could "convert" them to something smaller that uses letters as well.
Are you aware of any algorithm implementation for this?
algorithms data-compression base-conversion
New contributor
$endgroup$
add a comment |
$begingroup$
I have a US-based telephone number (in the format 000-000-0000) and need to convert it to a "shorter" representation. Using base32/64 produces too long of a string. I've discovered CRC-16, which produces a string of length 4, however, I'm not sure it's a suitable tool for the job (I want to prevent collisions).
Basically, if the input is 000-000-0000 (without the dashes), I want the output to be something like 4xg4. In this use case, the advantage is that I have 10 digits, thus I could "convert" them to something smaller that uses letters as well.
Are you aware of any algorithm implementation for this?
algorithms data-compression base-conversion
New contributor
$endgroup$
1
$begingroup$
Check out Huffman coding.
$endgroup$
– Pål GD
10 hours ago
1
$begingroup$
The answers below are mathematically correct. With the requirement of "no collisions" and no further information which could reduce your possible inputs, those are the best you can do. But if we understood why you needed a shorter representation, we might be able to offer other solutions. Disk space or memory seem unlikely constraints if you have the CPU to compress/decompress on the fly this kind of data. So is it some sort of UI constraint where the display needs to be shorter? Or something else?
$endgroup$
– JesseM
3 hours ago
add a comment |
$begingroup$
I have a US-based telephone number (in the format 000-000-0000) and need to convert it to a "shorter" representation. Using base32/64 produces too long of a string. I've discovered CRC-16, which produces a string of length 4, however, I'm not sure it's a suitable tool for the job (I want to prevent collisions).
Basically, if the input is 000-000-0000 (without the dashes), I want the output to be something like 4xg4. In this use case, the advantage is that I have 10 digits, thus I could "convert" them to something smaller that uses letters as well.
Are you aware of any algorithm implementation for this?
algorithms data-compression base-conversion
New contributor
$endgroup$
I have a US-based telephone number (in the format 000-000-0000) and need to convert it to a "shorter" representation. Using base32/64 produces too long of a string. I've discovered CRC-16, which produces a string of length 4, however, I'm not sure it's a suitable tool for the job (I want to prevent collisions).
Basically, if the input is 000-000-0000 (without the dashes), I want the output to be something like 4xg4. In this use case, the advantage is that I have 10 digits, thus I could "convert" them to something smaller that uses letters as well.
Are you aware of any algorithm implementation for this?
algorithms data-compression base-conversion
algorithms data-compression base-conversion
New contributor
New contributor
edited 11 hours ago
greybeard
4131314
4131314
New contributor
asked 12 hours ago
anemaria20anemaria20
1061
1061
New contributor
New contributor
1
$begingroup$
Check out Huffman coding.
$endgroup$
– Pål GD
10 hours ago
1
$begingroup$
The answers below are mathematically correct. With the requirement of "no collisions" and no further information which could reduce your possible inputs, those are the best you can do. But if we understood why you needed a shorter representation, we might be able to offer other solutions. Disk space or memory seem unlikely constraints if you have the CPU to compress/decompress on the fly this kind of data. So is it some sort of UI constraint where the display needs to be shorter? Or something else?
$endgroup$
– JesseM
3 hours ago
add a comment |
1
$begingroup$
Check out Huffman coding.
$endgroup$
– Pål GD
10 hours ago
1
$begingroup$
The answers below are mathematically correct. With the requirement of "no collisions" and no further information which could reduce your possible inputs, those are the best you can do. But if we understood why you needed a shorter representation, we might be able to offer other solutions. Disk space or memory seem unlikely constraints if you have the CPU to compress/decompress on the fly this kind of data. So is it some sort of UI constraint where the display needs to be shorter? Or something else?
$endgroup$
– JesseM
3 hours ago
1
1
$begingroup$
Check out Huffman coding.
$endgroup$
– Pål GD
10 hours ago
$begingroup$
Check out Huffman coding.
$endgroup$
– Pål GD
10 hours ago
1
1
$begingroup$
The answers below are mathematically correct. With the requirement of "no collisions" and no further information which could reduce your possible inputs, those are the best you can do. But if we understood why you needed a shorter representation, we might be able to offer other solutions. Disk space or memory seem unlikely constraints if you have the CPU to compress/decompress on the fly this kind of data. So is it some sort of UI constraint where the display needs to be shorter? Or something else?
$endgroup$
– JesseM
3 hours ago
$begingroup$
The answers below are mathematically correct. With the requirement of "no collisions" and no further information which could reduce your possible inputs, those are the best you can do. But if we understood why you needed a shorter representation, we might be able to offer other solutions. Disk space or memory seem unlikely constraints if you have the CPU to compress/decompress on the fly this kind of data. So is it some sort of UI constraint where the display needs to be shorter? Or something else?
$endgroup$
– JesseM
3 hours ago
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
You have 10 base10 digits. This is $10^{10}$ possible values.
Encoding this in an alphabet with 64 tokens you need $lceil log_{64}(10^{10}) rceil = 6$ characters.
If you encode it in straight binary you only need $lceil log_{2}(10^{10}) rceil = 34$ bits total. 4 bytes + 2 bits.
Unless there is a pattern in the number itself this is the best you can do.
Note that this looks at the number in a vacuum. If you encode more than just that number you can use arithmetic encoding to use some of the "spare room" the rounding creates.
$endgroup$
add a comment |
$begingroup$
As ratchet freak says, you have ten decimal digits, which should give $10^{10}$ possible values. But in practice, there are a few more restrictions. The format of a North American telephone number looks something like this:
[2-9][0-8]d - [2-9]dd - dddd
This gives 8*9*10 * 8*10*10 * 10*10*10*10 values. In addition, the fifth and sixth characters can't both be 1, which removes 1% of these. This means there are 5,702,400,000 possible telephone numbers. (Let's call this number $T$.)
If you want to encode these using an alphabet with $b$ different characters, you'll need a code $lceil log_b T rceil$ characters long. So if you want a code four characters long, you'll need $b geq 275$. If you want a code five characters long, you'll need $b geq 90$. And if you want a code six characters long, you'll need $b geq 43$.
Unfortunately, this is the best you can do, unless you have additional information about what numbers you'll be getting. This comes down to the pigeonhole principle, a remarkably straightforward theorem that's quite useful in CS. It's mathematically impossible to use fewer characters than this without ever having any collisions. You can make collisions extremely unlikely, but never fully prevent them.
(You can also make some of the codes be shorter without collisions, but as a consequence some of them will have to be longer. Again, pigeonhole principle.)
P.S. Note that, while base64 also produces a six-character string (since 64 is between 43 and 90), you can take advantage of the fact that you only need 43 to choose a "nicer" character set. For example, you can get rid of o
, O
, 0
, Q
, I
, l
, and 1
to cut down on confusion in writing. My suggestion would be 23456789ABCDEFGHJKLMNPRSTVXYZabdeghknpqrt-=
.
$endgroup$
$begingroup$
how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.
$endgroup$
– ratchet freak
9 hours ago
$begingroup$
@ratchetfreak D'oh, of course, you're right. Let me adjust that.
$endgroup$
– Draconis
8 hours ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "419"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
anemaria20 is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcs.stackexchange.com%2fquestions%2f105488%2falgorithm-to-convert-a-fixed-length-string-to-the-smallest-possible-collision-fr%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
You have 10 base10 digits. This is $10^{10}$ possible values.
Encoding this in an alphabet with 64 tokens you need $lceil log_{64}(10^{10}) rceil = 6$ characters.
If you encode it in straight binary you only need $lceil log_{2}(10^{10}) rceil = 34$ bits total. 4 bytes + 2 bits.
Unless there is a pattern in the number itself this is the best you can do.
Note that this looks at the number in a vacuum. If you encode more than just that number you can use arithmetic encoding to use some of the "spare room" the rounding creates.
$endgroup$
add a comment |
$begingroup$
You have 10 base10 digits. This is $10^{10}$ possible values.
Encoding this in an alphabet with 64 tokens you need $lceil log_{64}(10^{10}) rceil = 6$ characters.
If you encode it in straight binary you only need $lceil log_{2}(10^{10}) rceil = 34$ bits total. 4 bytes + 2 bits.
Unless there is a pattern in the number itself this is the best you can do.
Note that this looks at the number in a vacuum. If you encode more than just that number you can use arithmetic encoding to use some of the "spare room" the rounding creates.
$endgroup$
add a comment |
$begingroup$
You have 10 base10 digits. This is $10^{10}$ possible values.
Encoding this in an alphabet with 64 tokens you need $lceil log_{64}(10^{10}) rceil = 6$ characters.
If you encode it in straight binary you only need $lceil log_{2}(10^{10}) rceil = 34$ bits total. 4 bytes + 2 bits.
Unless there is a pattern in the number itself this is the best you can do.
Note that this looks at the number in a vacuum. If you encode more than just that number you can use arithmetic encoding to use some of the "spare room" the rounding creates.
$endgroup$
You have 10 base10 digits. This is $10^{10}$ possible values.
Encoding this in an alphabet with 64 tokens you need $lceil log_{64}(10^{10}) rceil = 6$ characters.
If you encode it in straight binary you only need $lceil log_{2}(10^{10}) rceil = 34$ bits total. 4 bytes + 2 bits.
Unless there is a pattern in the number itself this is the best you can do.
Note that this looks at the number in a vacuum. If you encode more than just that number you can use arithmetic encoding to use some of the "spare room" the rounding creates.
answered 12 hours ago
ratchet freakratchet freak
2,86388
2,86388
add a comment |
add a comment |
$begingroup$
As ratchet freak says, you have ten decimal digits, which should give $10^{10}$ possible values. But in practice, there are a few more restrictions. The format of a North American telephone number looks something like this:
[2-9][0-8]d - [2-9]dd - dddd
This gives 8*9*10 * 8*10*10 * 10*10*10*10 values. In addition, the fifth and sixth characters can't both be 1, which removes 1% of these. This means there are 5,702,400,000 possible telephone numbers. (Let's call this number $T$.)
If you want to encode these using an alphabet with $b$ different characters, you'll need a code $lceil log_b T rceil$ characters long. So if you want a code four characters long, you'll need $b geq 275$. If you want a code five characters long, you'll need $b geq 90$. And if you want a code six characters long, you'll need $b geq 43$.
Unfortunately, this is the best you can do, unless you have additional information about what numbers you'll be getting. This comes down to the pigeonhole principle, a remarkably straightforward theorem that's quite useful in CS. It's mathematically impossible to use fewer characters than this without ever having any collisions. You can make collisions extremely unlikely, but never fully prevent them.
(You can also make some of the codes be shorter without collisions, but as a consequence some of them will have to be longer. Again, pigeonhole principle.)
P.S. Note that, while base64 also produces a six-character string (since 64 is between 43 and 90), you can take advantage of the fact that you only need 43 to choose a "nicer" character set. For example, you can get rid of o
, O
, 0
, Q
, I
, l
, and 1
to cut down on confusion in writing. My suggestion would be 23456789ABCDEFGHJKLMNPRSTVXYZabdeghknpqrt-=
.
$endgroup$
$begingroup$
how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.
$endgroup$
– ratchet freak
9 hours ago
$begingroup$
@ratchetfreak D'oh, of course, you're right. Let me adjust that.
$endgroup$
– Draconis
8 hours ago
add a comment |
$begingroup$
As ratchet freak says, you have ten decimal digits, which should give $10^{10}$ possible values. But in practice, there are a few more restrictions. The format of a North American telephone number looks something like this:
[2-9][0-8]d - [2-9]dd - dddd
This gives 8*9*10 * 8*10*10 * 10*10*10*10 values. In addition, the fifth and sixth characters can't both be 1, which removes 1% of these. This means there are 5,702,400,000 possible telephone numbers. (Let's call this number $T$.)
If you want to encode these using an alphabet with $b$ different characters, you'll need a code $lceil log_b T rceil$ characters long. So if you want a code four characters long, you'll need $b geq 275$. If you want a code five characters long, you'll need $b geq 90$. And if you want a code six characters long, you'll need $b geq 43$.
Unfortunately, this is the best you can do, unless you have additional information about what numbers you'll be getting. This comes down to the pigeonhole principle, a remarkably straightforward theorem that's quite useful in CS. It's mathematically impossible to use fewer characters than this without ever having any collisions. You can make collisions extremely unlikely, but never fully prevent them.
(You can also make some of the codes be shorter without collisions, but as a consequence some of them will have to be longer. Again, pigeonhole principle.)
P.S. Note that, while base64 also produces a six-character string (since 64 is between 43 and 90), you can take advantage of the fact that you only need 43 to choose a "nicer" character set. For example, you can get rid of o
, O
, 0
, Q
, I
, l
, and 1
to cut down on confusion in writing. My suggestion would be 23456789ABCDEFGHJKLMNPRSTVXYZabdeghknpqrt-=
.
$endgroup$
$begingroup$
how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.
$endgroup$
– ratchet freak
9 hours ago
$begingroup$
@ratchetfreak D'oh, of course, you're right. Let me adjust that.
$endgroup$
– Draconis
8 hours ago
add a comment |
$begingroup$
As ratchet freak says, you have ten decimal digits, which should give $10^{10}$ possible values. But in practice, there are a few more restrictions. The format of a North American telephone number looks something like this:
[2-9][0-8]d - [2-9]dd - dddd
This gives 8*9*10 * 8*10*10 * 10*10*10*10 values. In addition, the fifth and sixth characters can't both be 1, which removes 1% of these. This means there are 5,702,400,000 possible telephone numbers. (Let's call this number $T$.)
If you want to encode these using an alphabet with $b$ different characters, you'll need a code $lceil log_b T rceil$ characters long. So if you want a code four characters long, you'll need $b geq 275$. If you want a code five characters long, you'll need $b geq 90$. And if you want a code six characters long, you'll need $b geq 43$.
Unfortunately, this is the best you can do, unless you have additional information about what numbers you'll be getting. This comes down to the pigeonhole principle, a remarkably straightforward theorem that's quite useful in CS. It's mathematically impossible to use fewer characters than this without ever having any collisions. You can make collisions extremely unlikely, but never fully prevent them.
(You can also make some of the codes be shorter without collisions, but as a consequence some of them will have to be longer. Again, pigeonhole principle.)
P.S. Note that, while base64 also produces a six-character string (since 64 is between 43 and 90), you can take advantage of the fact that you only need 43 to choose a "nicer" character set. For example, you can get rid of o
, O
, 0
, Q
, I
, l
, and 1
to cut down on confusion in writing. My suggestion would be 23456789ABCDEFGHJKLMNPRSTVXYZabdeghknpqrt-=
.
$endgroup$
As ratchet freak says, you have ten decimal digits, which should give $10^{10}$ possible values. But in practice, there are a few more restrictions. The format of a North American telephone number looks something like this:
[2-9][0-8]d - [2-9]dd - dddd
This gives 8*9*10 * 8*10*10 * 10*10*10*10 values. In addition, the fifth and sixth characters can't both be 1, which removes 1% of these. This means there are 5,702,400,000 possible telephone numbers. (Let's call this number $T$.)
If you want to encode these using an alphabet with $b$ different characters, you'll need a code $lceil log_b T rceil$ characters long. So if you want a code four characters long, you'll need $b geq 275$. If you want a code five characters long, you'll need $b geq 90$. And if you want a code six characters long, you'll need $b geq 43$.
Unfortunately, this is the best you can do, unless you have additional information about what numbers you'll be getting. This comes down to the pigeonhole principle, a remarkably straightforward theorem that's quite useful in CS. It's mathematically impossible to use fewer characters than this without ever having any collisions. You can make collisions extremely unlikely, but never fully prevent them.
(You can also make some of the codes be shorter without collisions, but as a consequence some of them will have to be longer. Again, pigeonhole principle.)
P.S. Note that, while base64 also produces a six-character string (since 64 is between 43 and 90), you can take advantage of the fact that you only need 43 to choose a "nicer" character set. For example, you can get rid of o
, O
, 0
, Q
, I
, l
, and 1
to cut down on confusion in writing. My suggestion would be 23456789ABCDEFGHJKLMNPRSTVXYZabdeghknpqrt-=
.
edited 7 hours ago
answered 10 hours ago
DraconisDraconis
5,242820
5,242820
$begingroup$
how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.
$endgroup$
– ratchet freak
9 hours ago
$begingroup$
@ratchetfreak D'oh, of course, you're right. Let me adjust that.
$endgroup$
– Draconis
8 hours ago
add a comment |
$begingroup$
how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.
$endgroup$
– ratchet freak
9 hours ago
$begingroup$
@ratchetfreak D'oh, of course, you're right. Let me adjust that.
$endgroup$
– Draconis
8 hours ago
$begingroup$
how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.
$endgroup$
– ratchet freak
9 hours ago
$begingroup$
how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.
$endgroup$
– ratchet freak
9 hours ago
$begingroup$
@ratchetfreak D'oh, of course, you're right. Let me adjust that.
$endgroup$
– Draconis
8 hours ago
$begingroup$
@ratchetfreak D'oh, of course, you're right. Let me adjust that.
$endgroup$
– Draconis
8 hours ago
add a comment |
anemaria20 is a new contributor. Be nice, and check out our Code of Conduct.
anemaria20 is a new contributor. Be nice, and check out our Code of Conduct.
anemaria20 is a new contributor. Be nice, and check out our Code of Conduct.
anemaria20 is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Computer Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcs.stackexchange.com%2fquestions%2f105488%2falgorithm-to-convert-a-fixed-length-string-to-the-smallest-possible-collision-fr%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
Check out Huffman coding.
$endgroup$
– Pål GD
10 hours ago
1
$begingroup$
The answers below are mathematically correct. With the requirement of "no collisions" and no further information which could reduce your possible inputs, those are the best you can do. But if we understood why you needed a shorter representation, we might be able to offer other solutions. Disk space or memory seem unlikely constraints if you have the CPU to compress/decompress on the fly this kind of data. So is it some sort of UI constraint where the display needs to be shorter? Or something else?
$endgroup$
– JesseM
3 hours ago