How to convert a text file with mixture of encodings to a single encoding?
- I created a text file by copying its different parts from different
sources (webpages, other text files, pdf files) into gedit and
saving it to the file. I guess that is the reason that I have
multiple encodings in the text file, but I am not sure. How can I
avoid creating a text file with mixed encodings by copying its
different parts from different sources into gedit?
Whenever I open the file in gedit, gedit can always show or decode
every part of the text correctly. It seems that gedit can handle a
text file with mixed encodings, but I am not sure.
But when I open the file in emacs, there will be characters that
can't be shown correctly. (I am not sure why emacs can't do that.)
So I would like to convert the file from mixed encodings to a single
encoding such as utf-8.
Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
utf-8, or at least tell me what encoding it finds for which part of the file?
Thanks.
gedit emacs encoding
|
show 3 more comments
- I created a text file by copying its different parts from different
sources (webpages, other text files, pdf files) into gedit and
saving it to the file. I guess that is the reason that I have
multiple encodings in the text file, but I am not sure. How can I
avoid creating a text file with mixed encodings by copying its
different parts from different sources into gedit?
Whenever I open the file in gedit, gedit can always show or decode
every part of the text correctly. It seems that gedit can handle a
text file with mixed encodings, but I am not sure.
But when I open the file in emacs, there will be characters that
can't be shown correctly. (I am not sure why emacs can't do that.)
So I would like to convert the file from mixed encodings to a single
encoding such as utf-8.
Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
utf-8, or at least tell me what encoding it finds for which part of the file?
Thanks.
gedit emacs encoding
When you clickFile > Save As
, you should see two options on the bottom of the window, one for character encoding, and second for line endings.
– jeremija
Sep 27 '14 at 17:37
Is that the encoding which gedit used for opening the text file?
– Tim
Sep 27 '14 at 17:39
Most probably it is.
– jeremija
Sep 27 '14 at 17:41
Is that also the encoding which gedit guessed for the text file?
– Tim
Sep 27 '14 at 17:42
I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.
– jeremija
Sep 27 '14 at 17:43
|
show 3 more comments
- I created a text file by copying its different parts from different
sources (webpages, other text files, pdf files) into gedit and
saving it to the file. I guess that is the reason that I have
multiple encodings in the text file, but I am not sure. How can I
avoid creating a text file with mixed encodings by copying its
different parts from different sources into gedit?
Whenever I open the file in gedit, gedit can always show or decode
every part of the text correctly. It seems that gedit can handle a
text file with mixed encodings, but I am not sure.
But when I open the file in emacs, there will be characters that
can't be shown correctly. (I am not sure why emacs can't do that.)
So I would like to convert the file from mixed encodings to a single
encoding such as utf-8.
Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
utf-8, or at least tell me what encoding it finds for which part of the file?
Thanks.
gedit emacs encoding
- I created a text file by copying its different parts from different
sources (webpages, other text files, pdf files) into gedit and
saving it to the file. I guess that is the reason that I have
multiple encodings in the text file, but I am not sure. How can I
avoid creating a text file with mixed encodings by copying its
different parts from different sources into gedit?
Whenever I open the file in gedit, gedit can always show or decode
every part of the text correctly. It seems that gedit can handle a
text file with mixed encodings, but I am not sure.
But when I open the file in emacs, there will be characters that
can't be shown correctly. (I am not sure why emacs can't do that.)
So I would like to convert the file from mixed encodings to a single
encoding such as utf-8.
Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
utf-8, or at least tell me what encoding it finds for which part of the file?
Thanks.
gedit emacs encoding
gedit emacs encoding
edited Sep 27 '14 at 17:44
Tim
asked Sep 27 '14 at 17:11
TimTim
8,06642104174
8,06642104174
When you clickFile > Save As
, you should see two options on the bottom of the window, one for character encoding, and second for line endings.
– jeremija
Sep 27 '14 at 17:37
Is that the encoding which gedit used for opening the text file?
– Tim
Sep 27 '14 at 17:39
Most probably it is.
– jeremija
Sep 27 '14 at 17:41
Is that also the encoding which gedit guessed for the text file?
– Tim
Sep 27 '14 at 17:42
I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.
– jeremija
Sep 27 '14 at 17:43
|
show 3 more comments
When you clickFile > Save As
, you should see two options on the bottom of the window, one for character encoding, and second for line endings.
– jeremija
Sep 27 '14 at 17:37
Is that the encoding which gedit used for opening the text file?
– Tim
Sep 27 '14 at 17:39
Most probably it is.
– jeremija
Sep 27 '14 at 17:41
Is that also the encoding which gedit guessed for the text file?
– Tim
Sep 27 '14 at 17:42
I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.
– jeremija
Sep 27 '14 at 17:43
When you click
File > Save As
, you should see two options on the bottom of the window, one for character encoding, and second for line endings.– jeremija
Sep 27 '14 at 17:37
When you click
File > Save As
, you should see two options on the bottom of the window, one for character encoding, and second for line endings.– jeremija
Sep 27 '14 at 17:37
Is that the encoding which gedit used for opening the text file?
– Tim
Sep 27 '14 at 17:39
Is that the encoding which gedit used for opening the text file?
– Tim
Sep 27 '14 at 17:39
Most probably it is.
– jeremija
Sep 27 '14 at 17:41
Most probably it is.
– jeremija
Sep 27 '14 at 17:41
Is that also the encoding which gedit guessed for the text file?
– Tim
Sep 27 '14 at 17:42
Is that also the encoding which gedit guessed for the text file?
– Tim
Sep 27 '14 at 17:42
I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.
– jeremija
Sep 27 '14 at 17:43
I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.
– jeremija
Sep 27 '14 at 17:43
|
show 3 more comments
2 Answers
2
active
oldest
votes
Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.
For Linux a file is a sequence of bytes. If you ask a program to interpret it as a text file, it will do it using a mapping between bytes and characters; this mapping is the encoding. Almost all the text editor I know (not word processors!) just understand the concept of one encoding for one file.
I am not expert on gedit
; maybe it is doing some magic like trying to autodetect the encoding line by line or text block by text block... if it is the case you can try to do the same using enca
(1):
while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8
...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).
(1) It's in the repos, just install it with sudo apt-get enca
.
(1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?
– Tim
Sep 27 '14 at 18:09
I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.
– Rmano
Sep 27 '14 at 22:10
add a comment |
I had the same problem and solved it with Emacs. The solution is quoted from here:
Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.
Another one is to split the the parts which have different encodings, paste them into different files, convert the encoding of the one and add it to the other. In my case this worked with Atom, but not with Notepad++ (utf16-le/be).
New contributor
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f529322%2fhow-to-convert-a-text-file-with-mixture-of-encodings-to-a-single-encoding%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.
For Linux a file is a sequence of bytes. If you ask a program to interpret it as a text file, it will do it using a mapping between bytes and characters; this mapping is the encoding. Almost all the text editor I know (not word processors!) just understand the concept of one encoding for one file.
I am not expert on gedit
; maybe it is doing some magic like trying to autodetect the encoding line by line or text block by text block... if it is the case you can try to do the same using enca
(1):
while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8
...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).
(1) It's in the repos, just install it with sudo apt-get enca
.
(1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?
– Tim
Sep 27 '14 at 18:09
I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.
– Rmano
Sep 27 '14 at 22:10
add a comment |
Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.
For Linux a file is a sequence of bytes. If you ask a program to interpret it as a text file, it will do it using a mapping between bytes and characters; this mapping is the encoding. Almost all the text editor I know (not word processors!) just understand the concept of one encoding for one file.
I am not expert on gedit
; maybe it is doing some magic like trying to autodetect the encoding line by line or text block by text block... if it is the case you can try to do the same using enca
(1):
while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8
...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).
(1) It's in the repos, just install it with sudo apt-get enca
.
(1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?
– Tim
Sep 27 '14 at 18:09
I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.
– Rmano
Sep 27 '14 at 22:10
add a comment |
Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.
For Linux a file is a sequence of bytes. If you ask a program to interpret it as a text file, it will do it using a mapping between bytes and characters; this mapping is the encoding. Almost all the text editor I know (not word processors!) just understand the concept of one encoding for one file.
I am not expert on gedit
; maybe it is doing some magic like trying to autodetect the encoding line by line or text block by text block... if it is the case you can try to do the same using enca
(1):
while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8
...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).
(1) It's in the repos, just install it with sudo apt-get enca
.
Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.
For Linux a file is a sequence of bytes. If you ask a program to interpret it as a text file, it will do it using a mapping between bytes and characters; this mapping is the encoding. Almost all the text editor I know (not word processors!) just understand the concept of one encoding for one file.
I am not expert on gedit
; maybe it is doing some magic like trying to autodetect the encoding line by line or text block by text block... if it is the case you can try to do the same using enca
(1):
while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8
...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).
(1) It's in the repos, just install it with sudo apt-get enca
.
edited Sep 27 '14 at 17:59
answered Sep 27 '14 at 17:53
RmanoRmano
25.3k879145
25.3k879145
(1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?
– Tim
Sep 27 '14 at 18:09
I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.
– Rmano
Sep 27 '14 at 22:10
add a comment |
(1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?
– Tim
Sep 27 '14 at 18:09
I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.
– Rmano
Sep 27 '14 at 22:10
(1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?
– Tim
Sep 27 '14 at 18:09
(1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?
– Tim
Sep 27 '14 at 18:09
I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.
– Rmano
Sep 27 '14 at 22:10
I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.
– Rmano
Sep 27 '14 at 22:10
add a comment |
I had the same problem and solved it with Emacs. The solution is quoted from here:
Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.
Another one is to split the the parts which have different encodings, paste them into different files, convert the encoding of the one and add it to the other. In my case this worked with Atom, but not with Notepad++ (utf16-le/be).
New contributor
add a comment |
I had the same problem and solved it with Emacs. The solution is quoted from here:
Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.
Another one is to split the the parts which have different encodings, paste them into different files, convert the encoding of the one and add it to the other. In my case this worked with Atom, but not with Notepad++ (utf16-le/be).
New contributor
add a comment |
I had the same problem and solved it with Emacs. The solution is quoted from here:
Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.
Another one is to split the the parts which have different encodings, paste them into different files, convert the encoding of the one and add it to the other. In my case this worked with Atom, but not with Notepad++ (utf16-le/be).
New contributor
I had the same problem and solved it with Emacs. The solution is quoted from here:
Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.
Another one is to split the the parts which have different encodings, paste them into different files, convert the encoding of the one and add it to the other. In my case this worked with Atom, but not with Notepad++ (utf16-le/be).
New contributor
edited 2 days ago
New contributor
answered Jan 23 at 18:46
giordanogiordano
1113
1113
New contributor
New contributor
add a comment |
add a comment |
Thanks for contributing an answer to Ask Ubuntu!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f529322%2fhow-to-convert-a-text-file-with-mixture-of-encodings-to-a-single-encoding%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
When you click
File > Save As
, you should see two options on the bottom of the window, one for character encoding, and second for line endings.– jeremija
Sep 27 '14 at 17:37
Is that the encoding which gedit used for opening the text file?
– Tim
Sep 27 '14 at 17:39
Most probably it is.
– jeremija
Sep 27 '14 at 17:41
Is that also the encoding which gedit guessed for the text file?
– Tim
Sep 27 '14 at 17:42
I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.
– jeremija
Sep 27 '14 at 17:43