How to convert a text file with mixture of encodings to a single encoding?












3
















  1. I created a text file by copying its different parts from different
    sources (webpages, other text files, pdf files) into gedit and
    saving it to the file. I guess that is the reason that I have
    multiple encodings in the text file, but I am not sure. How can I
    avoid creating a text file with mixed encodings by copying its
    different parts from different sources into gedit?


  2. Whenever I open the file in gedit, gedit can always show or decode
    every part of the text correctly. It seems that gedit can handle a
    text file with mixed encodings, but I am not sure.



    But when I open the file in emacs, there will be characters that
    can't be shown correctly. (I am not sure why emacs can't do that.)
    So I would like to convert the file from mixed encodings to a single
    encoding such as utf-8.



    Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
    utf-8, or at least tell me what encoding it finds for which part of the file?




Thanks.










share|improve this question

























  • When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

    – jeremija
    Sep 27 '14 at 17:37











  • Is that the encoding which gedit used for opening the text file?

    – Tim
    Sep 27 '14 at 17:39













  • Most probably it is.

    – jeremija
    Sep 27 '14 at 17:41











  • Is that also the encoding which gedit guessed for the text file?

    – Tim
    Sep 27 '14 at 17:42











  • I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

    – jeremija
    Sep 27 '14 at 17:43
















3
















  1. I created a text file by copying its different parts from different
    sources (webpages, other text files, pdf files) into gedit and
    saving it to the file. I guess that is the reason that I have
    multiple encodings in the text file, but I am not sure. How can I
    avoid creating a text file with mixed encodings by copying its
    different parts from different sources into gedit?


  2. Whenever I open the file in gedit, gedit can always show or decode
    every part of the text correctly. It seems that gedit can handle a
    text file with mixed encodings, but I am not sure.



    But when I open the file in emacs, there will be characters that
    can't be shown correctly. (I am not sure why emacs can't do that.)
    So I would like to convert the file from mixed encodings to a single
    encoding such as utf-8.



    Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
    utf-8, or at least tell me what encoding it finds for which part of the file?




Thanks.










share|improve this question

























  • When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

    – jeremija
    Sep 27 '14 at 17:37











  • Is that the encoding which gedit used for opening the text file?

    – Tim
    Sep 27 '14 at 17:39













  • Most probably it is.

    – jeremija
    Sep 27 '14 at 17:41











  • Is that also the encoding which gedit guessed for the text file?

    – Tim
    Sep 27 '14 at 17:42











  • I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

    – jeremija
    Sep 27 '14 at 17:43














3












3








3









  1. I created a text file by copying its different parts from different
    sources (webpages, other text files, pdf files) into gedit and
    saving it to the file. I guess that is the reason that I have
    multiple encodings in the text file, but I am not sure. How can I
    avoid creating a text file with mixed encodings by copying its
    different parts from different sources into gedit?


  2. Whenever I open the file in gedit, gedit can always show or decode
    every part of the text correctly. It seems that gedit can handle a
    text file with mixed encodings, but I am not sure.



    But when I open the file in emacs, there will be characters that
    can't be shown correctly. (I am not sure why emacs can't do that.)
    So I would like to convert the file from mixed encodings to a single
    encoding such as utf-8.



    Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
    utf-8, or at least tell me what encoding it finds for which part of the file?




Thanks.










share|improve this question

















  1. I created a text file by copying its different parts from different
    sources (webpages, other text files, pdf files) into gedit and
    saving it to the file. I guess that is the reason that I have
    multiple encodings in the text file, but I am not sure. How can I
    avoid creating a text file with mixed encodings by copying its
    different parts from different sources into gedit?


  2. Whenever I open the file in gedit, gedit can always show or decode
    every part of the text correctly. It seems that gedit can handle a
    text file with mixed encodings, but I am not sure.



    But when I open the file in emacs, there will be characters that
    can't be shown correctly. (I am not sure why emacs can't do that.)
    So I would like to convert the file from mixed encodings to a single
    encoding such as utf-8.



    Since I think gedit can detect the correct encodings for different parts of the text file, and I don't know if there are other applications that can do so, would it be possible to ask gedit to convert the file to
    utf-8, or at least tell me what encoding it finds for which part of the file?




Thanks.







gedit emacs encoding






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Sep 27 '14 at 17:44







Tim

















asked Sep 27 '14 at 17:11









TimTim

8,06642104174




8,06642104174













  • When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

    – jeremija
    Sep 27 '14 at 17:37











  • Is that the encoding which gedit used for opening the text file?

    – Tim
    Sep 27 '14 at 17:39













  • Most probably it is.

    – jeremija
    Sep 27 '14 at 17:41











  • Is that also the encoding which gedit guessed for the text file?

    – Tim
    Sep 27 '14 at 17:42











  • I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

    – jeremija
    Sep 27 '14 at 17:43



















  • When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

    – jeremija
    Sep 27 '14 at 17:37











  • Is that the encoding which gedit used for opening the text file?

    – Tim
    Sep 27 '14 at 17:39













  • Most probably it is.

    – jeremija
    Sep 27 '14 at 17:41











  • Is that also the encoding which gedit guessed for the text file?

    – Tim
    Sep 27 '14 at 17:42











  • I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

    – jeremija
    Sep 27 '14 at 17:43

















When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

– jeremija
Sep 27 '14 at 17:37





When you click File > Save As, you should see two options on the bottom of the window, one for character encoding, and second for line endings.

– jeremija
Sep 27 '14 at 17:37













Is that the encoding which gedit used for opening the text file?

– Tim
Sep 27 '14 at 17:39







Is that the encoding which gedit used for opening the text file?

– Tim
Sep 27 '14 at 17:39















Most probably it is.

– jeremija
Sep 27 '14 at 17:41





Most probably it is.

– jeremija
Sep 27 '14 at 17:41













Is that also the encoding which gedit guessed for the text file?

– Tim
Sep 27 '14 at 17:42





Is that also the encoding which gedit guessed for the text file?

– Tim
Sep 27 '14 at 17:42













I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

– jeremija
Sep 27 '14 at 17:43





I guess so. When you open a file you can also choose an encoding to use, or you can let it auto detect the encoding.

– jeremija
Sep 27 '14 at 17:43










2 Answers
2






active

oldest

votes


















2














Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.



For Linux a file is a sequence of bytes. If you ask a program to interpret it as a text file, it will do it using a mapping between bytes and characters; this mapping is the encoding. Almost all the text editor I know (not word processors!) just understand the concept of one encoding for one file.



I am not expert on gedit; maybe it is doing some magic like trying to autodetect the encoding line by line or text block by text block... if it is the case you can try to do the same using enca(1):



 while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8


...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).



(1) It's in the repos, just install it with sudo apt-get enca.






share|improve this answer


























  • (1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

    – Tim
    Sep 27 '14 at 18:09













  • I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

    – Rmano
    Sep 27 '14 at 22:10



















1














I had the same problem and solved it with Emacs. The solution is quoted from here:




Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.




Another one is to split the the parts which have different encodings, paste them into different files, convert the encoding of the one and add it to the other. In my case this worked with Atom, but not with Notepad++ (utf16-le/be).






share|improve this answer










New contributor




giordano is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




















    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "89"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f529322%2fhow-to-convert-a-text-file-with-mixture-of-encodings-to-a-single-encoding%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2














    Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.



    For Linux a file is a sequence of bytes. If you ask a program to interpret it as a text file, it will do it using a mapping between bytes and characters; this mapping is the encoding. Almost all the text editor I know (not word processors!) just understand the concept of one encoding for one file.



    I am not expert on gedit; maybe it is doing some magic like trying to autodetect the encoding line by line or text block by text block... if it is the case you can try to do the same using enca(1):



     while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8


    ...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).



    (1) It's in the repos, just install it with sudo apt-get enca.






    share|improve this answer


























    • (1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

      – Tim
      Sep 27 '14 at 18:09













    • I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

      – Rmano
      Sep 27 '14 at 22:10
















    2














    Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.



    For Linux a file is a sequence of bytes. If you ask a program to interpret it as a text file, it will do it using a mapping between bytes and characters; this mapping is the encoding. Almost all the text editor I know (not word processors!) just understand the concept of one encoding for one file.



    I am not expert on gedit; maybe it is doing some magic like trying to autodetect the encoding line by line or text block by text block... if it is the case you can try to do the same using enca(1):



     while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8


    ...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).



    (1) It's in the repos, just install it with sudo apt-get enca.






    share|improve this answer


























    • (1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

      – Tim
      Sep 27 '14 at 18:09













    • I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

      – Rmano
      Sep 27 '14 at 22:10














    2












    2








    2







    Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.



    For Linux a file is a sequence of bytes. If you ask a program to interpret it as a text file, it will do it using a mapping between bytes and characters; this mapping is the encoding. Almost all the text editor I know (not word processors!) just understand the concept of one encoding for one file.



    I am not expert on gedit; maybe it is doing some magic like trying to autodetect the encoding line by line or text block by text block... if it is the case you can try to do the same using enca(1):



     while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8


    ...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).



    (1) It's in the repos, just install it with sudo apt-get enca.






    share|improve this answer















    Hmmm... the concept of a file with various encoding is somewhat wobbly, to be honest. If you have a bit of time, this article (and this one) are worth reading.



    For Linux a file is a sequence of bytes. If you ask a program to interpret it as a text file, it will do it using a mapping between bytes and characters; this mapping is the encoding. Almost all the text editor I know (not word processors!) just understand the concept of one encoding for one file.



    I am not expert on gedit; maybe it is doing some magic like trying to autodetect the encoding line by line or text block by text block... if it is the case you can try to do the same using enca(1):



     while read line; do echo $line | enconv -L none -x utf8; done < text.mixed > text.utf8


    ...but it depends on how good is enca in guessing you encoding (works almost well with Eastern European, but not with Latin1, for example).



    (1) It's in the repos, just install it with sudo apt-get enca.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Sep 27 '14 at 17:59

























    answered Sep 27 '14 at 17:53









    RmanoRmano

    25.3k879145




    25.3k879145













    • (1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

      – Tim
      Sep 27 '14 at 18:09













    • I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

      – Rmano
      Sep 27 '14 at 22:10



















    • (1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

      – Tim
      Sep 27 '14 at 18:09













    • I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

      – Rmano
      Sep 27 '14 at 22:10

















    (1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

    – Tim
    Sep 27 '14 at 18:09







    (1) But I may be wrong. I get the impression of mixed encodings from using tools such as chardet to detect the mixed encodings for the file, and from failing to convert all of the text file to another encoding by specifying only one original encoding. (2) Generally, if you copy texts from different sources possibly with different encodings to a gedit window, and save it to a file, will gedit save it with just one encoding? Which encoding does gedit use for saving? Does that involve conversion from the sources' encodings to the file's encoding?

    – Tim
    Sep 27 '14 at 18:09















    I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

    – Rmano
    Sep 27 '14 at 22:10





    I do not know if the copy/paste protocol has the possibility to send the encoding together with the data, and if all the copy sources send it. I fear not, but I could be wrong. So probably there is a guessing wrong here. I know for sure that copy and paste between different encoding seldom works.

    – Rmano
    Sep 27 '14 at 22:10













    1














    I had the same problem and solved it with Emacs. The solution is quoted from here:




    Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.




    Another one is to split the the parts which have different encodings, paste them into different files, convert the encoding of the one and add it to the other. In my case this worked with Atom, but not with Notepad++ (utf16-le/be).






    share|improve this answer










    New contributor




    giordano is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.

























      1














      I had the same problem and solved it with Emacs. The solution is quoted from here:




      Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.




      Another one is to split the the parts which have different encodings, paste them into different files, convert the encoding of the one and add it to the other. In my case this worked with Atom, but not with Notepad++ (utf16-le/be).






      share|improve this answer










      New contributor




      giordano is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.























        1












        1








        1







        I had the same problem and solved it with Emacs. The solution is quoted from here:




        Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.




        Another one is to split the the parts which have different encodings, paste them into different files, convert the encoding of the one and add it to the other. In my case this worked with Atom, but not with Notepad++ (utf16-le/be).






        share|improve this answer










        New contributor




        giordano is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.










        I had the same problem and solved it with Emacs. The solution is quoted from here:




        Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.




        Another one is to split the the parts which have different encodings, paste them into different files, convert the encoding of the one and add it to the other. In my case this worked with Atom, but not with Notepad++ (utf16-le/be).







        share|improve this answer










        New contributor




        giordano is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        share|improve this answer



        share|improve this answer








        edited 2 days ago





















        New contributor




        giordano is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        answered Jan 23 at 18:46









        giordanogiordano

        1113




        1113




        New contributor




        giordano is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.





        New contributor





        giordano is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        giordano is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Ask Ubuntu!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f529322%2fhow-to-convert-a-text-file-with-mixture-of-encodings-to-a-single-encoding%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How did Captain America manage to do this?

            迪纳利

            南乌拉尔铁路局