How to convert a text file with mixture of encodings to a single encoding?

I created a text file by copying its different parts from different
sources (webpages, other text files, pdf files) into gedit and
saving it to the file. I guess that is the reason that I have
multiple encodings in the text file, but I am not sure. How can I
avoid creating a text file with mixed encodings by copying its
different parts from different sources into gedit?

Whenever I open the file in gedit, gedit can always show or decode
every part of the text correctly. It seems that gedit can handle a
text file with mixed encodings, but I am not sure.

But when I open the file in emacs, there will be characters that
can't be shown correctly. (I am not sure why emacs can't do that.)
So I would like to convert the file from mixed encodings to a single
encoding such as utf-8.

Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
utf-8, or at least tell me what encoding it finds for which part of the file?

Thanks.

edited Sep 27 '14 at 17:44

asked Sep 27 '14 at 17:11

Tim

8,06642104174

When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

– jeremija
Sep 27 '14 at 17:37

Is that the encoding which gedit used for opening the text file?

– Tim
Sep 27 '14 at 17:39

Most probably it is.

– jeremija
Sep 27 '14 at 17:41

Is that also the encoding which gedit guessed for the text file?

– Tim
Sep 27 '14 at 17:42

I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

– jeremija
Sep 27 '14 at 17:43

|
show 3 more comments

I created a text file by copying its different parts from different
sources (webpages, other text files, pdf files) into gedit and
saving it to the file. I guess that is the reason that I have
multiple encodings in the text file, but I am not sure. How can I
avoid creating a text file with mixed encodings by copying its
different parts from different sources into gedit?

Whenever I open the file in gedit, gedit can always show or decode
every part of the text correctly. It seems that gedit can handle a
text file with mixed encodings, but I am not sure.

But when I open the file in emacs, there will be characters that
can't be shown correctly. (I am not sure why emacs can't do that.)
So I would like to convert the file from mixed encodings to a single
encoding such as utf-8.

Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
utf-8, or at least tell me what encoding it finds for which part of the file?

Thanks.

edited Sep 27 '14 at 17:44

asked Sep 27 '14 at 17:11

Tim

8,06642104174

When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

– jeremija
Sep 27 '14 at 17:37

Is that the encoding which gedit used for opening the text file?

– Tim
Sep 27 '14 at 17:39

Most probably it is.

– jeremija
Sep 27 '14 at 17:41

Is that also the encoding which gedit guessed for the text file?

– Tim
Sep 27 '14 at 17:42

I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

– jeremija
Sep 27 '14 at 17:43

|
show 3 more comments

I created a text file by copying its different parts from different
sources (webpages, other text files, pdf files) into gedit and
saving it to the file. I guess that is the reason that I have
multiple encodings in the text file, but I am not sure. How can I
avoid creating a text file with mixed encodings by copying its
different parts from different sources into gedit?

Whenever I open the file in gedit, gedit can always show or decode
every part of the text correctly. It seems that gedit can handle a
text file with mixed encodings, but I am not sure.

But when I open the file in emacs, there will be characters that
can't be shown correctly. (I am not sure why emacs can't do that.)
So I would like to convert the file from mixed encodings to a single
encoding such as utf-8.

Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
utf-8, or at least tell me what encoding it finds for which part of the file?

Thanks.

edited Sep 27 '14 at 17:44

asked Sep 27 '14 at 17:11

Tim

8,06642104174

I created a text file by copying its different parts from different
sources (webpages, other text files, pdf files) into gedit and
saving it to the file. I guess that is the reason that I have
multiple encodings in the text file, but I am not sure. How can I
avoid creating a text file with mixed encodings by copying its
different parts from different sources into gedit?

Whenever I open the file in gedit, gedit can always show or decode
every part of the text correctly. It seems that gedit can handle a
text file with mixed encodings, but I am not sure.

But when I open the file in emacs, there will be characters that
can't be shown correctly. (I am not sure why emacs can't do that.)
So I would like to convert the file from mixed encodings to a single
encoding such as utf-8.

Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
utf-8, or at least tell me what encoding it finds for which part of the file?

Thanks.

gedit emacs encoding

edited Sep 27 '14 at 17:44

asked Sep 27 '14 at 17:11

Tim

8,06642104174

edited Sep 27 '14 at 17:44

asked Sep 27 '14 at 17:11

Tim

8,06642104174

edited Sep 27 '14 at 17:44

asked Sep 27 '14 at 17:11

Tim

8,06642104174

asked Sep 27 '14 at 17:11

Tim

8,06642104174

asked Sep 27 '14 at 17:11

Tim

8,06642104174

When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

– jeremija
Sep 27 '14 at 17:37

Is that the encoding which gedit used for opening the text file?

– Tim
Sep 27 '14 at 17:39

Most probably it is.

– jeremija
Sep 27 '14 at 17:41

Is that also the encoding which gedit guessed for the text file?

– Tim
Sep 27 '14 at 17:42

I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

– jeremija
Sep 27 '14 at 17:43

|
show 3 more comments

When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

– jeremija
Sep 27 '14 at 17:37

Is that the encoding which gedit used for opening the text file?

– Tim
Sep 27 '14 at 17:39

Most probably it is.

– jeremija
Sep 27 '14 at 17:41

Is that also the encoding which gedit guessed for the text file?

– Tim
Sep 27 '14 at 17:42

I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

– jeremija
Sep 27 '14 at 17:43

When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

– jeremija
Sep 27 '14 at 17:37

Is that the encoding which gedit used for opening the text file?

– Tim
Sep 27 '14 at 17:39

Most probably it is.

– jeremija
Sep 27 '14 at 17:41

Is that also the encoding which gedit guessed for the text file?

– Tim
Sep 27 '14 at 17:42

I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

– jeremija
Sep 27 '14 at 17:43

|
show 3 more comments

2 Answers
2

active

oldest

votes

Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.

For Linux a file is a sequence of bytes. If you ask a program to interpret it as a text file, it will do it using a mapping between bytes and characters; this mapping is the encoding. Almost all the text editor I know (not word processors!) just understand the concept of one encoding for one file.

I am not expert on gedit; maybe it is doing some magic like trying to autodetect the encoding line by line or text block by text block... if it is the case you can try to do the same using enca(1):

 while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8

...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).

(1) It's in the repos, just install it with sudo apt-get enca.

edited Sep 27 '14 at 17:59

answered Sep 27 '14 at 17:53

Rmano

25.3k879145

(1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

– Tim
Sep 27 '14 at 18:09

I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

– Rmano
Sep 27 '14 at 22:10

add a comment |

I had the same problem and solved it with Emacs. The solution is quoted from here:

Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.

Another one is to split the the parts which have different encodings, paste them into different files, convert the encoding of the one and add it to the other. In my case this worked with Atom, but not with Notepad++ (utf16-le/be).

edited 2 days ago

answered Jan 23 at 18:46

giordano

1113

New contributor

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f529322%2fhow-to-convert-a-text-file-with-mixture-of-encodings-to-a-single-encoding%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.

 while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8

...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).

(1) It's in the repos, just install it with sudo apt-get enca.

edited Sep 27 '14 at 17:59

answered Sep 27 '14 at 17:53

Rmano

25.3k879145

(1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

– Tim
Sep 27 '14 at 18:09

I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

– Rmano
Sep 27 '14 at 22:10

add a comment |

Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.

 while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8

...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).

(1) It's in the repos, just install it with sudo apt-get enca.

edited Sep 27 '14 at 17:59

answered Sep 27 '14 at 17:53

Rmano

25.3k879145

(1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

– Tim
Sep 27 '14 at 18:09

I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

– Rmano
Sep 27 '14 at 22:10

add a comment |

Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.

 while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8

...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).

(1) It's in the repos, just install it with sudo apt-get enca.

edited Sep 27 '14 at 17:59

answered Sep 27 '14 at 17:53

Rmano

25.3k879145

Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.

 while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8

...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).

(1) It's in the repos, just install it with sudo apt-get enca.

edited Sep 27 '14 at 17:59

answered Sep 27 '14 at 17:53

Rmano

25.3k879145

edited Sep 27 '14 at 17:59

answered Sep 27 '14 at 17:53

Rmano

25.3k879145

answered Sep 27 '14 at 17:53

Rmano

25.3k879145

answered Sep 27 '14 at 17:53

Rmano

25.3k879145

(1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

– Tim
Sep 27 '14 at 18:09

I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

– Rmano
Sep 27 '14 at 22:10

add a comment |

(1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

– Tim
Sep 27 '14 at 18:09

I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

– Rmano
Sep 27 '14 at 22:10

(1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

– Tim
Sep 27 '14 at 18:09

I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

– Rmano
Sep 27 '14 at 22:10

add a comment |

I had the same problem and solved it with Emacs. The solution is quoted from here:

Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.

edited 2 days ago

answered Jan 23 at 18:46

giordano

1113

New contributor

add a comment |

I had the same problem and solved it with Emacs. The solution is quoted from here:

Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.

edited 2 days ago

answered Jan 23 at 18:46

giordano

1113

New contributor

add a comment |

I had the same problem and solved it with Emacs. The solution is quoted from here:

Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.

edited 2 days ago

answered Jan 23 at 18:46

giordano

1113

New contributor

I had the same problem and solved it with Emacs. The solution is quoted from here:

Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.

edited 2 days ago

answered Jan 23 at 18:46

giordano

1113

New contributor

edited 2 days ago

answered Jan 23 at 18:46

giordano

1113

New contributor

answered Jan 23 at 18:46

giordano

1113

answered Jan 23 at 18:46

giordano

1113

New contributor

giordano is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Ask Ubuntu!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfxtrjtrk