Command line tool to search and replace text on a PDF












4














I have a PDF that has my name as an obnoxious watermark through out a rather long PDF file. I tried replacing the text in LibreOffice Draw with blanks, but while my name does appear as text, the find and replace function seems to tank my computer taking significant RAM and CPU time to do.



Is there a command line way to remove strings from PDF? Hmm... can sed do that?










share|improve this question





























    4














    I have a PDF that has my name as an obnoxious watermark through out a rather long PDF file. I tried replacing the text in LibreOffice Draw with blanks, but while my name does appear as text, the find and replace function seems to tank my computer taking significant RAM and CPU time to do.



    Is there a command line way to remove strings from PDF? Hmm... can sed do that?










    share|improve this question



























      4












      4








      4


      2





      I have a PDF that has my name as an obnoxious watermark through out a rather long PDF file. I tried replacing the text in LibreOffice Draw with blanks, but while my name does appear as text, the find and replace function seems to tank my computer taking significant RAM and CPU time to do.



      Is there a command line way to remove strings from PDF? Hmm... can sed do that?










      share|improve this question















      I have a PDF that has my name as an obnoxious watermark through out a rather long PDF file. I tried replacing the text in LibreOffice Draw with blanks, but while my name does appear as text, the find and replace function seems to tank my computer taking significant RAM and CPU time to do.



      Is there a command line way to remove strings from PDF? Hmm... can sed do that?







      command-line libreoffice pdf






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 26 at 3:05









      Pablo Bianchi

      2,3571528




      2,3571528










      asked Dec 14 at 21:45









      j0h

      6,2451352112




      6,2451352112






















          2 Answers
          2






          active

          oldest

          votes


















          7














          As in many cases it’s just text, you can often remove it simply with sed or in fact any text editor – let’s say it says “watermark”:



          sed 's/watermark//g' in.pdf >out.pdf


          If your PDF file is compressed this doesn’t work, you need to uncompress it first, e.g. with pdftk (How can I install pdftk in Ubuntu 18.04 and later?):



          pdftk in.pdf output out.pdf uncompress 


          If sed’s output is not readable with your preferred PDF reader, try repairing it with pdftk:



          pdftk out.pdf output out_pdftk.pdf


          Further reading: How to Edit PDFs?



          Source: How to remove watermark from pdf using pdftk • Super User






          share|improve this answer



















          • 1




            Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence your sed command will not succeed -- not even after uncompressing the PDF.
            – Kurt Pfeifle
            Dec 25 at 23:14










          • @KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases where sed fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!
            – dessert
            Dec 25 at 23:44










          • PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases where sed method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.
            – Kurt Pfeifle
            Dec 26 at 0:09





















          3














          Accepted answer will work only in rare cases



          Sorry, the answer given by @dessert is as wrong as it could be as a general advice. It will not work for the general case of text replacement in PDFs (watermarks or not), and you'll have to be very lucky for very rare cases of PDFs you encounter were it would work. (Moreover, watermarks inserted by LibreOffice frequently are converted into vector or pixel graphics, even if they appear like text when printed or viewed on screen.... but this case I'll not discuss any further -- below I deal only with real text contents in a PDF.)



          Reasons



          The reasons for this are these:




          1. What appears to be ASCII text in the visual representation of its content in a PDF viewer, very likely will not be ASCII text inside the PDF source code. Instead it may be hex encoded.


          2. Additionally, an ASCII string's individual characters might be placed on the page in a consecutive order, but they may easily be placed individually, with each having its own coordinate information sprinkled in between the individual characters...


          3. Also, the hex encoding of the ASCII (and non-ASCII) character table (the "mapping") will not be predictable, and it may change from font to font.



          Hence in all these cases your sed command will not succeed -- not even after uncompressing the PDF.



          Example



          Here is an example for the "string" Watermark, how it can appear inside a PDF created with LibreOffice:



          56.8 726.989 Td /F2 16 Tf[<01>29<0203>-2<0405>6<06>-1<020507>]TJ


          I'll dissect for you what that means:




          • 56.8 726.989 Td: Td is an operator to move the text positioning on the page; 56.8 726.989 are the x-/y-coordinates to describe that exact position.


          • /F2 16 Tf: Tf is an operator to set a certain font as well as its size as the currently active one; in this case it is the font tagged elsewhere with the name /F2 and its size should be 16 pt.



          • [<01>29<0203>-2<0405>6<06>-1<020507>]TJ: TJ is an operator to show text while at the same time allowing for individual glyph positioning. The meaning of the hex snippets enclosed by angle brackets are the following, according to the 'charmap' table specific for that PDF and the used font:




            • <01>: this is the 'W'.


            • <0203>: this is the 'at'.


            • <0405>: this is the 'er'.


            • <06>: this is the 'm'.


            • <020507>: this is the 'ark'.



            The numbers in between these hex snippets (29, -2, 6 and -1) are correction values which determine the individual spacings of the different characters.




          Now you show me how you'd replace that "string" by something else by using sed... Remember, you do not know the encoding in advance, nor the placement correction numbers, when you deal with an arbitrary PDF. You can only find out by opening its source code in an editor and analysing its content.



          Executive Summary



          No, there is no command line way to reliably remove unwanted strings from a PDF!



          You can only do this if...



          (a) ...you are a PDF expert who is skilled to read the PDF source code;



          (b) ...you are prepared to analyse the PDF file in question individually;



          (c) ...you use a text editor to modify its contents after uncompressing the PDF source code.



          WARNING: The answer currently marked as 'accepted' might have worked for the specific PDF of the OP. However, it will not work in the general case. Don't take the "recipe" it advertises for granted!






          share|improve this answer























            Your Answer








            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "89"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1100970%2fcommand-line-tool-to-search-and-replace-text-on-a-pdf%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            7














            As in many cases it’s just text, you can often remove it simply with sed or in fact any text editor – let’s say it says “watermark”:



            sed 's/watermark//g' in.pdf >out.pdf


            If your PDF file is compressed this doesn’t work, you need to uncompress it first, e.g. with pdftk (How can I install pdftk in Ubuntu 18.04 and later?):



            pdftk in.pdf output out.pdf uncompress 


            If sed’s output is not readable with your preferred PDF reader, try repairing it with pdftk:



            pdftk out.pdf output out_pdftk.pdf


            Further reading: How to Edit PDFs?



            Source: How to remove watermark from pdf using pdftk • Super User






            share|improve this answer



















            • 1




              Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence your sed command will not succeed -- not even after uncompressing the PDF.
              – Kurt Pfeifle
              Dec 25 at 23:14










            • @KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases where sed fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!
              – dessert
              Dec 25 at 23:44










            • PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases where sed method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.
              – Kurt Pfeifle
              Dec 26 at 0:09


















            7














            As in many cases it’s just text, you can often remove it simply with sed or in fact any text editor – let’s say it says “watermark”:



            sed 's/watermark//g' in.pdf >out.pdf


            If your PDF file is compressed this doesn’t work, you need to uncompress it first, e.g. with pdftk (How can I install pdftk in Ubuntu 18.04 and later?):



            pdftk in.pdf output out.pdf uncompress 


            If sed’s output is not readable with your preferred PDF reader, try repairing it with pdftk:



            pdftk out.pdf output out_pdftk.pdf


            Further reading: How to Edit PDFs?



            Source: How to remove watermark from pdf using pdftk • Super User






            share|improve this answer



















            • 1




              Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence your sed command will not succeed -- not even after uncompressing the PDF.
              – Kurt Pfeifle
              Dec 25 at 23:14










            • @KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases where sed fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!
              – dessert
              Dec 25 at 23:44










            • PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases where sed method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.
              – Kurt Pfeifle
              Dec 26 at 0:09
















            7












            7








            7






            As in many cases it’s just text, you can often remove it simply with sed or in fact any text editor – let’s say it says “watermark”:



            sed 's/watermark//g' in.pdf >out.pdf


            If your PDF file is compressed this doesn’t work, you need to uncompress it first, e.g. with pdftk (How can I install pdftk in Ubuntu 18.04 and later?):



            pdftk in.pdf output out.pdf uncompress 


            If sed’s output is not readable with your preferred PDF reader, try repairing it with pdftk:



            pdftk out.pdf output out_pdftk.pdf


            Further reading: How to Edit PDFs?



            Source: How to remove watermark from pdf using pdftk • Super User






            share|improve this answer














            As in many cases it’s just text, you can often remove it simply with sed or in fact any text editor – let’s say it says “watermark”:



            sed 's/watermark//g' in.pdf >out.pdf


            If your PDF file is compressed this doesn’t work, you need to uncompress it first, e.g. with pdftk (How can I install pdftk in Ubuntu 18.04 and later?):



            pdftk in.pdf output out.pdf uncompress 


            If sed’s output is not readable with your preferred PDF reader, try repairing it with pdftk:



            pdftk out.pdf output out_pdftk.pdf


            Further reading: How to Edit PDFs?



            Source: How to remove watermark from pdf using pdftk • Super User







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Dec 25 at 23:39

























            answered Dec 14 at 21:58









            dessert

            22k56198




            22k56198








            • 1




              Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence your sed command will not succeed -- not even after uncompressing the PDF.
              – Kurt Pfeifle
              Dec 25 at 23:14










            • @KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases where sed fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!
              – dessert
              Dec 25 at 23:44










            • PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases where sed method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.
              – Kurt Pfeifle
              Dec 26 at 0:09
















            • 1




              Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence your sed command will not succeed -- not even after uncompressing the PDF.
              – Kurt Pfeifle
              Dec 25 at 23:14










            • @KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases where sed fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!
              – dessert
              Dec 25 at 23:44










            • PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases where sed method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.
              – Kurt Pfeifle
              Dec 26 at 0:09










            1




            1




            Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence your sed command will not succeed -- not even after uncompressing the PDF.
            – Kurt Pfeifle
            Dec 25 at 23:14




            Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence your sed command will not succeed -- not even after uncompressing the PDF.
            – Kurt Pfeifle
            Dec 25 at 23:14












            @KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases where sed fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!
            – dessert
            Dec 25 at 23:44




            @KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases where sed fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!
            – dessert
            Dec 25 at 23:44












            PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases where sed method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.
            – Kurt Pfeifle
            Dec 26 at 0:09






            PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases where sed method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.
            – Kurt Pfeifle
            Dec 26 at 0:09















            3














            Accepted answer will work only in rare cases



            Sorry, the answer given by @dessert is as wrong as it could be as a general advice. It will not work for the general case of text replacement in PDFs (watermarks or not), and you'll have to be very lucky for very rare cases of PDFs you encounter were it would work. (Moreover, watermarks inserted by LibreOffice frequently are converted into vector or pixel graphics, even if they appear like text when printed or viewed on screen.... but this case I'll not discuss any further -- below I deal only with real text contents in a PDF.)



            Reasons



            The reasons for this are these:




            1. What appears to be ASCII text in the visual representation of its content in a PDF viewer, very likely will not be ASCII text inside the PDF source code. Instead it may be hex encoded.


            2. Additionally, an ASCII string's individual characters might be placed on the page in a consecutive order, but they may easily be placed individually, with each having its own coordinate information sprinkled in between the individual characters...


            3. Also, the hex encoding of the ASCII (and non-ASCII) character table (the "mapping") will not be predictable, and it may change from font to font.



            Hence in all these cases your sed command will not succeed -- not even after uncompressing the PDF.



            Example



            Here is an example for the "string" Watermark, how it can appear inside a PDF created with LibreOffice:



            56.8 726.989 Td /F2 16 Tf[<01>29<0203>-2<0405>6<06>-1<020507>]TJ


            I'll dissect for you what that means:




            • 56.8 726.989 Td: Td is an operator to move the text positioning on the page; 56.8 726.989 are the x-/y-coordinates to describe that exact position.


            • /F2 16 Tf: Tf is an operator to set a certain font as well as its size as the currently active one; in this case it is the font tagged elsewhere with the name /F2 and its size should be 16 pt.



            • [<01>29<0203>-2<0405>6<06>-1<020507>]TJ: TJ is an operator to show text while at the same time allowing for individual glyph positioning. The meaning of the hex snippets enclosed by angle brackets are the following, according to the 'charmap' table specific for that PDF and the used font:




              • <01>: this is the 'W'.


              • <0203>: this is the 'at'.


              • <0405>: this is the 'er'.


              • <06>: this is the 'm'.


              • <020507>: this is the 'ark'.



              The numbers in between these hex snippets (29, -2, 6 and -1) are correction values which determine the individual spacings of the different characters.




            Now you show me how you'd replace that "string" by something else by using sed... Remember, you do not know the encoding in advance, nor the placement correction numbers, when you deal with an arbitrary PDF. You can only find out by opening its source code in an editor and analysing its content.



            Executive Summary



            No, there is no command line way to reliably remove unwanted strings from a PDF!



            You can only do this if...



            (a) ...you are a PDF expert who is skilled to read the PDF source code;



            (b) ...you are prepared to analyse the PDF file in question individually;



            (c) ...you use a text editor to modify its contents after uncompressing the PDF source code.



            WARNING: The answer currently marked as 'accepted' might have worked for the specific PDF of the OP. However, it will not work in the general case. Don't take the "recipe" it advertises for granted!






            share|improve this answer




























              3














              Accepted answer will work only in rare cases



              Sorry, the answer given by @dessert is as wrong as it could be as a general advice. It will not work for the general case of text replacement in PDFs (watermarks or not), and you'll have to be very lucky for very rare cases of PDFs you encounter were it would work. (Moreover, watermarks inserted by LibreOffice frequently are converted into vector or pixel graphics, even if they appear like text when printed or viewed on screen.... but this case I'll not discuss any further -- below I deal only with real text contents in a PDF.)



              Reasons



              The reasons for this are these:




              1. What appears to be ASCII text in the visual representation of its content in a PDF viewer, very likely will not be ASCII text inside the PDF source code. Instead it may be hex encoded.


              2. Additionally, an ASCII string's individual characters might be placed on the page in a consecutive order, but they may easily be placed individually, with each having its own coordinate information sprinkled in between the individual characters...


              3. Also, the hex encoding of the ASCII (and non-ASCII) character table (the "mapping") will not be predictable, and it may change from font to font.



              Hence in all these cases your sed command will not succeed -- not even after uncompressing the PDF.



              Example



              Here is an example for the "string" Watermark, how it can appear inside a PDF created with LibreOffice:



              56.8 726.989 Td /F2 16 Tf[<01>29<0203>-2<0405>6<06>-1<020507>]TJ


              I'll dissect for you what that means:




              • 56.8 726.989 Td: Td is an operator to move the text positioning on the page; 56.8 726.989 are the x-/y-coordinates to describe that exact position.


              • /F2 16 Tf: Tf is an operator to set a certain font as well as its size as the currently active one; in this case it is the font tagged elsewhere with the name /F2 and its size should be 16 pt.



              • [<01>29<0203>-2<0405>6<06>-1<020507>]TJ: TJ is an operator to show text while at the same time allowing for individual glyph positioning. The meaning of the hex snippets enclosed by angle brackets are the following, according to the 'charmap' table specific for that PDF and the used font:




                • <01>: this is the 'W'.


                • <0203>: this is the 'at'.


                • <0405>: this is the 'er'.


                • <06>: this is the 'm'.


                • <020507>: this is the 'ark'.



                The numbers in between these hex snippets (29, -2, 6 and -1) are correction values which determine the individual spacings of the different characters.




              Now you show me how you'd replace that "string" by something else by using sed... Remember, you do not know the encoding in advance, nor the placement correction numbers, when you deal with an arbitrary PDF. You can only find out by opening its source code in an editor and analysing its content.



              Executive Summary



              No, there is no command line way to reliably remove unwanted strings from a PDF!



              You can only do this if...



              (a) ...you are a PDF expert who is skilled to read the PDF source code;



              (b) ...you are prepared to analyse the PDF file in question individually;



              (c) ...you use a text editor to modify its contents after uncompressing the PDF source code.



              WARNING: The answer currently marked as 'accepted' might have worked for the specific PDF of the OP. However, it will not work in the general case. Don't take the "recipe" it advertises for granted!






              share|improve this answer


























                3












                3








                3






                Accepted answer will work only in rare cases



                Sorry, the answer given by @dessert is as wrong as it could be as a general advice. It will not work for the general case of text replacement in PDFs (watermarks or not), and you'll have to be very lucky for very rare cases of PDFs you encounter were it would work. (Moreover, watermarks inserted by LibreOffice frequently are converted into vector or pixel graphics, even if they appear like text when printed or viewed on screen.... but this case I'll not discuss any further -- below I deal only with real text contents in a PDF.)



                Reasons



                The reasons for this are these:




                1. What appears to be ASCII text in the visual representation of its content in a PDF viewer, very likely will not be ASCII text inside the PDF source code. Instead it may be hex encoded.


                2. Additionally, an ASCII string's individual characters might be placed on the page in a consecutive order, but they may easily be placed individually, with each having its own coordinate information sprinkled in between the individual characters...


                3. Also, the hex encoding of the ASCII (and non-ASCII) character table (the "mapping") will not be predictable, and it may change from font to font.



                Hence in all these cases your sed command will not succeed -- not even after uncompressing the PDF.



                Example



                Here is an example for the "string" Watermark, how it can appear inside a PDF created with LibreOffice:



                56.8 726.989 Td /F2 16 Tf[<01>29<0203>-2<0405>6<06>-1<020507>]TJ


                I'll dissect for you what that means:




                • 56.8 726.989 Td: Td is an operator to move the text positioning on the page; 56.8 726.989 are the x-/y-coordinates to describe that exact position.


                • /F2 16 Tf: Tf is an operator to set a certain font as well as its size as the currently active one; in this case it is the font tagged elsewhere with the name /F2 and its size should be 16 pt.



                • [<01>29<0203>-2<0405>6<06>-1<020507>]TJ: TJ is an operator to show text while at the same time allowing for individual glyph positioning. The meaning of the hex snippets enclosed by angle brackets are the following, according to the 'charmap' table specific for that PDF and the used font:




                  • <01>: this is the 'W'.


                  • <0203>: this is the 'at'.


                  • <0405>: this is the 'er'.


                  • <06>: this is the 'm'.


                  • <020507>: this is the 'ark'.



                  The numbers in between these hex snippets (29, -2, 6 and -1) are correction values which determine the individual spacings of the different characters.




                Now you show me how you'd replace that "string" by something else by using sed... Remember, you do not know the encoding in advance, nor the placement correction numbers, when you deal with an arbitrary PDF. You can only find out by opening its source code in an editor and analysing its content.



                Executive Summary



                No, there is no command line way to reliably remove unwanted strings from a PDF!



                You can only do this if...



                (a) ...you are a PDF expert who is skilled to read the PDF source code;



                (b) ...you are prepared to analyse the PDF file in question individually;



                (c) ...you use a text editor to modify its contents after uncompressing the PDF source code.



                WARNING: The answer currently marked as 'accepted' might have worked for the specific PDF of the OP. However, it will not work in the general case. Don't take the "recipe" it advertises for granted!






                share|improve this answer














                Accepted answer will work only in rare cases



                Sorry, the answer given by @dessert is as wrong as it could be as a general advice. It will not work for the general case of text replacement in PDFs (watermarks or not), and you'll have to be very lucky for very rare cases of PDFs you encounter were it would work. (Moreover, watermarks inserted by LibreOffice frequently are converted into vector or pixel graphics, even if they appear like text when printed or viewed on screen.... but this case I'll not discuss any further -- below I deal only with real text contents in a PDF.)



                Reasons



                The reasons for this are these:




                1. What appears to be ASCII text in the visual representation of its content in a PDF viewer, very likely will not be ASCII text inside the PDF source code. Instead it may be hex encoded.


                2. Additionally, an ASCII string's individual characters might be placed on the page in a consecutive order, but they may easily be placed individually, with each having its own coordinate information sprinkled in between the individual characters...


                3. Also, the hex encoding of the ASCII (and non-ASCII) character table (the "mapping") will not be predictable, and it may change from font to font.



                Hence in all these cases your sed command will not succeed -- not even after uncompressing the PDF.



                Example



                Here is an example for the "string" Watermark, how it can appear inside a PDF created with LibreOffice:



                56.8 726.989 Td /F2 16 Tf[<01>29<0203>-2<0405>6<06>-1<020507>]TJ


                I'll dissect for you what that means:




                • 56.8 726.989 Td: Td is an operator to move the text positioning on the page; 56.8 726.989 are the x-/y-coordinates to describe that exact position.


                • /F2 16 Tf: Tf is an operator to set a certain font as well as its size as the currently active one; in this case it is the font tagged elsewhere with the name /F2 and its size should be 16 pt.



                • [<01>29<0203>-2<0405>6<06>-1<020507>]TJ: TJ is an operator to show text while at the same time allowing for individual glyph positioning. The meaning of the hex snippets enclosed by angle brackets are the following, according to the 'charmap' table specific for that PDF and the used font:




                  • <01>: this is the 'W'.


                  • <0203>: this is the 'at'.


                  • <0405>: this is the 'er'.


                  • <06>: this is the 'm'.


                  • <020507>: this is the 'ark'.



                  The numbers in between these hex snippets (29, -2, 6 and -1) are correction values which determine the individual spacings of the different characters.




                Now you show me how you'd replace that "string" by something else by using sed... Remember, you do not know the encoding in advance, nor the placement correction numbers, when you deal with an arbitrary PDF. You can only find out by opening its source code in an editor and analysing its content.



                Executive Summary



                No, there is no command line way to reliably remove unwanted strings from a PDF!



                You can only do this if...



                (a) ...you are a PDF expert who is skilled to read the PDF source code;



                (b) ...you are prepared to analyse the PDF file in question individually;



                (c) ...you use a text editor to modify its contents after uncompressing the PDF source code.



                WARNING: The answer currently marked as 'accepted' might have worked for the specific PDF of the OP. However, it will not work in the general case. Don't take the "recipe" it advertises for granted!







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Dec 26 at 0:12

























                answered Dec 26 at 0:07









                Kurt Pfeifle

                1,023710




                1,023710






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Ask Ubuntu!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1100970%2fcommand-line-tool-to-search-and-replace-text-on-a-pdf%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How did Captain America manage to do this?

                    迪纳利

                    南乌拉尔铁路局