How to detect pages in PDF files that are actually two-pages?





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}







3















I have a PDF (it does not contain scanned images), each page of which is actually two-page, like this:



enter image description here



However there are some normal pages, so when I wrote a program to convert the file to normal pages I have to scroll through the file and identify the exception pages and write it to a list so that the program may know specifically which page not to cut in half (I used mutool for the cutting, it works for this type of file).



So how can I detect which page is normal and which page is not? Please help me, thank you very much.










share|improve this question























  • You can try to play with pdfinfo from poppler-utils package (or other utils from it).

    – N0rbert
    Mar 29 at 12:49













  • @N0rbert thank you very much, I found out that pdfinfo did not help much, but pdftohtml helped a lot, because I can use the html converted from the pdf as input to the regex and this helps me detect the single-pages. It is admittedly not optimal, but it is acceptable for me

    – Dang Manh Truong
    Mar 30 at 9:29


















3















I have a PDF (it does not contain scanned images), each page of which is actually two-page, like this:



enter image description here



However there are some normal pages, so when I wrote a program to convert the file to normal pages I have to scroll through the file and identify the exception pages and write it to a list so that the program may know specifically which page not to cut in half (I used mutool for the cutting, it works for this type of file).



So how can I detect which page is normal and which page is not? Please help me, thank you very much.










share|improve this question























  • You can try to play with pdfinfo from poppler-utils package (or other utils from it).

    – N0rbert
    Mar 29 at 12:49













  • @N0rbert thank you very much, I found out that pdfinfo did not help much, but pdftohtml helped a lot, because I can use the html converted from the pdf as input to the regex and this helps me detect the single-pages. It is admittedly not optimal, but it is acceptable for me

    – Dang Manh Truong
    Mar 30 at 9:29














3












3








3








I have a PDF (it does not contain scanned images), each page of which is actually two-page, like this:



enter image description here



However there are some normal pages, so when I wrote a program to convert the file to normal pages I have to scroll through the file and identify the exception pages and write it to a list so that the program may know specifically which page not to cut in half (I used mutool for the cutting, it works for this type of file).



So how can I detect which page is normal and which page is not? Please help me, thank you very much.










share|improve this question














I have a PDF (it does not contain scanned images), each page of which is actually two-page, like this:



enter image description here



However there are some normal pages, so when I wrote a program to convert the file to normal pages I have to scroll through the file and identify the exception pages and write it to a list so that the program may know specifically which page not to cut in half (I used mutool for the cutting, it works for this type of file).



So how can I detect which page is normal and which page is not? Please help me, thank you very much.







pdf






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 29 at 12:28









Dang Manh TruongDang Manh Truong

51119




51119













  • You can try to play with pdfinfo from poppler-utils package (or other utils from it).

    – N0rbert
    Mar 29 at 12:49













  • @N0rbert thank you very much, I found out that pdfinfo did not help much, but pdftohtml helped a lot, because I can use the html converted from the pdf as input to the regex and this helps me detect the single-pages. It is admittedly not optimal, but it is acceptable for me

    – Dang Manh Truong
    Mar 30 at 9:29



















  • You can try to play with pdfinfo from poppler-utils package (or other utils from it).

    – N0rbert
    Mar 29 at 12:49













  • @N0rbert thank you very much, I found out that pdfinfo did not help much, but pdftohtml helped a lot, because I can use the html converted from the pdf as input to the regex and this helps me detect the single-pages. It is admittedly not optimal, but it is acceptable for me

    – Dang Manh Truong
    Mar 30 at 9:29

















You can try to play with pdfinfo from poppler-utils package (or other utils from it).

– N0rbert
Mar 29 at 12:49







You can try to play with pdfinfo from poppler-utils package (or other utils from it).

– N0rbert
Mar 29 at 12:49















@N0rbert thank you very much, I found out that pdfinfo did not help much, but pdftohtml helped a lot, because I can use the html converted from the pdf as input to the regex and this helps me detect the single-pages. It is admittedly not optimal, but it is acceptable for me

– Dang Manh Truong
Mar 30 at 9:29





@N0rbert thank you very much, I found out that pdfinfo did not help much, but pdftohtml helped a lot, because I can use the html converted from the pdf as input to the regex and this helps me detect the single-pages. It is admittedly not optimal, but it is acceptable for me

– Dang Manh Truong
Mar 30 at 9:29










1 Answer
1






active

oldest

votes


















2














After playing with several utilities from the




poppler-utils




package, I finally arrived at an acceptable, but not optimal, solution.



It turns out that detecting double-pages in PDF files is a rather tricky business. I was unable to find any library that can do so easily. So in the end, I decided to use




pdftohtml




, which is a tool from the




poppler-utils




package, to convert each page into html, then use regular expression to extract pages that are not double pages. Interestingly, I was able to get most of the cases correctly by just using one or two lines in the html file. It does not work on all cases, as there are double-pages that are marked as single-page, but it seems that there is no single-page that is marked as double-page so there is no risk of damaging the original file.



Here is what I did: I mostly depended on detecting the header number, which is in almost cases the first line of the html file (of course after several lines that are the same across all pages)



I used the fact that in the introduction of the file, the header number uses Roman numbering, so I used the corresponding regex:



if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or 
re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):


Another thing I noticed is that if the line (actually the 31st line, since the first 30 lines are the same across all the pages) contains image link then it is likely not needed to be cut in half (there are cases where the left page is blank and the right page contains an image, but these are few and far between, so I just have to iterate through each page in the result and remove those that are double-page). I simply search for the string "img".



I also found out that double-pages contain the header number right at the beginning, so I simply used:



if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or 
re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>V.  I.  L ª - n i n &#[0-9]*;<br/>', line):


(the last line is due to some special pages that need special treatment)



In the end, it does not detect all single-pages but the good thing is that it does not wrongly consider any single-page as double-page, so suppose the result is [1, 5, 100] then I can simply iterate through the list and check visually for each case. Although this is still not completely automated but this is much much better than having to check each single page.



For those interesed, here is my code (in Python 2.7):



# -*- coding: utf-8 -*-
#!/usr/bin/python
#

import re
import pdb
import os
import errno
import subprocess

# Find pages that are not double page
# OS: Ubuntu
# Requirements: Python 2.7, pdftohtml


def silentremove(filename):
try:
os.remove(filename)
except OSError as e: # this would be "except OSError, e:" before Python 2.6
if e.errno != errno.ENOENT: # errno.ENOENT = no such file or directory
raise # re-raise exception if a different error occurred


num_of_pages = 395
input = "Lenin06.pdf"
excps =
i = 1
with open(input, 'rt') as fid:
while 1:
if i > num_of_pages:
break
if (i == 1) or (i == 2):
excps.append(str(i))
i += 1
continue
if (i == 3) or (i == 4):
i += 1
continue
cmd = "pdftohtml -i %s -f %d -l %d" % (input, i, i)
os.system(cmd)
html_file = input[:-4] + "s.html"
with open(html_file, 'rt') as html_fid:
for j in range(30):
line = html_fid.readline()
line = html_fid.readline()
line = line.strip()

if re.findall("img", line):
excps.append(str(i))
else:
if re.findall('<a name=[0-9]*></a>&#[0-9]*;<br/>', line):
excps.append(str(i))
else:
if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):
# Loi tua (Introduction)
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
continue
else:
if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
re.findall('<a name=[0-9]*></a>V.  I.  L ª - n i n &#[0-9]*;<br/>', line):
# print "haha"
# Trang doi (Double page)
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
continue
else:
if re.findall('<a name=[0-9]*></a>[^0-9&#;]* <br/>', line) and
re.findall('^[0-9]*&#[0-9]*;<br/>$', html_fid.readline().strip()):
# 1 so truong hop trang trai trong, trang phai co chu
# (Some cases where the left page is blank while the right page contains
# text)
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
continue
else:
excps.append(str(i))
pass
pass
pass
silentremove(input[:-4] + ".html")
silentremove(input[:-4] + "_ind.html")
silentremove(input[:-4] + "s.html")
i += 1
pass
for file in os.listdir("./"):
if file.endswith(".png") or file.endswith(".jpg") or file.endswith(".jpeg"):
silentremove(file)
pass
pdb.set_trace()


And this is the file: https://drive.google.com/open?id=1vjnebt3xEuY8odhZHPwL8pf26l8ySdnE (this is just an example, I have many more that needs to be converted to single-pages)






share|improve this answer
























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "89"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1129677%2fhow-to-detect-pages-in-pdf-files-that-are-actually-two-pages%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2














    After playing with several utilities from the




    poppler-utils




    package, I finally arrived at an acceptable, but not optimal, solution.



    It turns out that detecting double-pages in PDF files is a rather tricky business. I was unable to find any library that can do so easily. So in the end, I decided to use




    pdftohtml




    , which is a tool from the




    poppler-utils




    package, to convert each page into html, then use regular expression to extract pages that are not double pages. Interestingly, I was able to get most of the cases correctly by just using one or two lines in the html file. It does not work on all cases, as there are double-pages that are marked as single-page, but it seems that there is no single-page that is marked as double-page so there is no risk of damaging the original file.



    Here is what I did: I mostly depended on detecting the header number, which is in almost cases the first line of the html file (of course after several lines that are the same across all pages)



    I used the fact that in the introduction of the file, the header number uses Roman numbering, so I used the corresponding regex:



    if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or 
    re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
    re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
    re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):


    Another thing I noticed is that if the line (actually the 31st line, since the first 30 lines are the same across all the pages) contains image link then it is likely not needed to be cut in half (there are cases where the left page is blank and the right page contains an image, but these are few and far between, so I just have to iterate through each page in the result and remove those that are double-page). I simply search for the string "img".



    I also found out that double-pages contain the header number right at the beginning, so I simply used:



    if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or 
    re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
    re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
    re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
    re.findall('<a name=[0-9]*></a>V.  I.  L ª - n i n &#[0-9]*;<br/>', line):


    (the last line is due to some special pages that need special treatment)



    In the end, it does not detect all single-pages but the good thing is that it does not wrongly consider any single-page as double-page, so suppose the result is [1, 5, 100] then I can simply iterate through the list and check visually for each case. Although this is still not completely automated but this is much much better than having to check each single page.



    For those interesed, here is my code (in Python 2.7):



    # -*- coding: utf-8 -*-
    #!/usr/bin/python
    #

    import re
    import pdb
    import os
    import errno
    import subprocess

    # Find pages that are not double page
    # OS: Ubuntu
    # Requirements: Python 2.7, pdftohtml


    def silentremove(filename):
    try:
    os.remove(filename)
    except OSError as e: # this would be "except OSError, e:" before Python 2.6
    if e.errno != errno.ENOENT: # errno.ENOENT = no such file or directory
    raise # re-raise exception if a different error occurred


    num_of_pages = 395
    input = "Lenin06.pdf"
    excps =
    i = 1
    with open(input, 'rt') as fid:
    while 1:
    if i > num_of_pages:
    break
    if (i == 1) or (i == 2):
    excps.append(str(i))
    i += 1
    continue
    if (i == 3) or (i == 4):
    i += 1
    continue
    cmd = "pdftohtml -i %s -f %d -l %d" % (input, i, i)
    os.system(cmd)
    html_file = input[:-4] + "s.html"
    with open(html_file, 'rt') as html_fid:
    for j in range(30):
    line = html_fid.readline()
    line = html_fid.readline()
    line = line.strip()

    if re.findall("img", line):
    excps.append(str(i))
    else:
    if re.findall('<a name=[0-9]*></a>&#[0-9]*;<br/>', line):
    excps.append(str(i))
    else:
    if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or
    re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
    re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
    re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):
    # Loi tua (Introduction)
    silentremove(input[:-4] + ".html")
    silentremove(input[:-4] + "_ind.html")
    silentremove(input[:-4] + "s.html")
    i += 1
    continue
    else:
    if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or
    re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
    re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
    re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
    re.findall('<a name=[0-9]*></a>V.  I.  L ª - n i n &#[0-9]*;<br/>', line):
    # print "haha"
    # Trang doi (Double page)
    silentremove(input[:-4] + ".html")
    silentremove(input[:-4] + "_ind.html")
    silentremove(input[:-4] + "s.html")
    i += 1
    continue
    else:
    if re.findall('<a name=[0-9]*></a>[^0-9&#;]* <br/>', line) and
    re.findall('^[0-9]*&#[0-9]*;<br/>$', html_fid.readline().strip()):
    # 1 so truong hop trang trai trong, trang phai co chu
    # (Some cases where the left page is blank while the right page contains
    # text)
    silentremove(input[:-4] + ".html")
    silentremove(input[:-4] + "_ind.html")
    silentremove(input[:-4] + "s.html")
    i += 1
    continue
    else:
    excps.append(str(i))
    pass
    pass
    pass
    silentremove(input[:-4] + ".html")
    silentremove(input[:-4] + "_ind.html")
    silentremove(input[:-4] + "s.html")
    i += 1
    pass
    for file in os.listdir("./"):
    if file.endswith(".png") or file.endswith(".jpg") or file.endswith(".jpeg"):
    silentremove(file)
    pass
    pdb.set_trace()


    And this is the file: https://drive.google.com/open?id=1vjnebt3xEuY8odhZHPwL8pf26l8ySdnE (this is just an example, I have many more that needs to be converted to single-pages)






    share|improve this answer




























      2














      After playing with several utilities from the




      poppler-utils




      package, I finally arrived at an acceptable, but not optimal, solution.



      It turns out that detecting double-pages in PDF files is a rather tricky business. I was unable to find any library that can do so easily. So in the end, I decided to use




      pdftohtml




      , which is a tool from the




      poppler-utils




      package, to convert each page into html, then use regular expression to extract pages that are not double pages. Interestingly, I was able to get most of the cases correctly by just using one or two lines in the html file. It does not work on all cases, as there are double-pages that are marked as single-page, but it seems that there is no single-page that is marked as double-page so there is no risk of damaging the original file.



      Here is what I did: I mostly depended on detecting the header number, which is in almost cases the first line of the html file (of course after several lines that are the same across all pages)



      I used the fact that in the introduction of the file, the header number uses Roman numbering, so I used the corresponding regex:



      if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or 
      re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
      re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
      re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):


      Another thing I noticed is that if the line (actually the 31st line, since the first 30 lines are the same across all the pages) contains image link then it is likely not needed to be cut in half (there are cases where the left page is blank and the right page contains an image, but these are few and far between, so I just have to iterate through each page in the result and remove those that are double-page). I simply search for the string "img".



      I also found out that double-pages contain the header number right at the beginning, so I simply used:



      if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or 
      re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
      re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
      re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
      re.findall('<a name=[0-9]*></a>V.  I.  L ª - n i n &#[0-9]*;<br/>', line):


      (the last line is due to some special pages that need special treatment)



      In the end, it does not detect all single-pages but the good thing is that it does not wrongly consider any single-page as double-page, so suppose the result is [1, 5, 100] then I can simply iterate through the list and check visually for each case. Although this is still not completely automated but this is much much better than having to check each single page.



      For those interesed, here is my code (in Python 2.7):



      # -*- coding: utf-8 -*-
      #!/usr/bin/python
      #

      import re
      import pdb
      import os
      import errno
      import subprocess

      # Find pages that are not double page
      # OS: Ubuntu
      # Requirements: Python 2.7, pdftohtml


      def silentremove(filename):
      try:
      os.remove(filename)
      except OSError as e: # this would be "except OSError, e:" before Python 2.6
      if e.errno != errno.ENOENT: # errno.ENOENT = no such file or directory
      raise # re-raise exception if a different error occurred


      num_of_pages = 395
      input = "Lenin06.pdf"
      excps =
      i = 1
      with open(input, 'rt') as fid:
      while 1:
      if i > num_of_pages:
      break
      if (i == 1) or (i == 2):
      excps.append(str(i))
      i += 1
      continue
      if (i == 3) or (i == 4):
      i += 1
      continue
      cmd = "pdftohtml -i %s -f %d -l %d" % (input, i, i)
      os.system(cmd)
      html_file = input[:-4] + "s.html"
      with open(html_file, 'rt') as html_fid:
      for j in range(30):
      line = html_fid.readline()
      line = html_fid.readline()
      line = line.strip()

      if re.findall("img", line):
      excps.append(str(i))
      else:
      if re.findall('<a name=[0-9]*></a>&#[0-9]*;<br/>', line):
      excps.append(str(i))
      else:
      if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or
      re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
      re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
      re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):
      # Loi tua (Introduction)
      silentremove(input[:-4] + ".html")
      silentremove(input[:-4] + "_ind.html")
      silentremove(input[:-4] + "s.html")
      i += 1
      continue
      else:
      if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or
      re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
      re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
      re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
      re.findall('<a name=[0-9]*></a>V.  I.  L ª - n i n &#[0-9]*;<br/>', line):
      # print "haha"
      # Trang doi (Double page)
      silentremove(input[:-4] + ".html")
      silentremove(input[:-4] + "_ind.html")
      silentremove(input[:-4] + "s.html")
      i += 1
      continue
      else:
      if re.findall('<a name=[0-9]*></a>[^0-9&#;]* <br/>', line) and
      re.findall('^[0-9]*&#[0-9]*;<br/>$', html_fid.readline().strip()):
      # 1 so truong hop trang trai trong, trang phai co chu
      # (Some cases where the left page is blank while the right page contains
      # text)
      silentremove(input[:-4] + ".html")
      silentremove(input[:-4] + "_ind.html")
      silentremove(input[:-4] + "s.html")
      i += 1
      continue
      else:
      excps.append(str(i))
      pass
      pass
      pass
      silentremove(input[:-4] + ".html")
      silentremove(input[:-4] + "_ind.html")
      silentremove(input[:-4] + "s.html")
      i += 1
      pass
      for file in os.listdir("./"):
      if file.endswith(".png") or file.endswith(".jpg") or file.endswith(".jpeg"):
      silentremove(file)
      pass
      pdb.set_trace()


      And this is the file: https://drive.google.com/open?id=1vjnebt3xEuY8odhZHPwL8pf26l8ySdnE (this is just an example, I have many more that needs to be converted to single-pages)






      share|improve this answer


























        2












        2








        2







        After playing with several utilities from the




        poppler-utils




        package, I finally arrived at an acceptable, but not optimal, solution.



        It turns out that detecting double-pages in PDF files is a rather tricky business. I was unable to find any library that can do so easily. So in the end, I decided to use




        pdftohtml




        , which is a tool from the




        poppler-utils




        package, to convert each page into html, then use regular expression to extract pages that are not double pages. Interestingly, I was able to get most of the cases correctly by just using one or two lines in the html file. It does not work on all cases, as there are double-pages that are marked as single-page, but it seems that there is no single-page that is marked as double-page so there is no risk of damaging the original file.



        Here is what I did: I mostly depended on detecting the header number, which is in almost cases the first line of the html file (of course after several lines that are the same across all pages)



        I used the fact that in the introduction of the file, the header number uses Roman numbering, so I used the corresponding regex:



        if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or 
        re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
        re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
        re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):


        Another thing I noticed is that if the line (actually the 31st line, since the first 30 lines are the same across all the pages) contains image link then it is likely not needed to be cut in half (there are cases where the left page is blank and the right page contains an image, but these are few and far between, so I just have to iterate through each page in the result and remove those that are double-page). I simply search for the string "img".



        I also found out that double-pages contain the header number right at the beginning, so I simply used:



        if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or 
        re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
        re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
        re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
        re.findall('<a name=[0-9]*></a>V.  I.  L ª - n i n &#[0-9]*;<br/>', line):


        (the last line is due to some special pages that need special treatment)



        In the end, it does not detect all single-pages but the good thing is that it does not wrongly consider any single-page as double-page, so suppose the result is [1, 5, 100] then I can simply iterate through the list and check visually for each case. Although this is still not completely automated but this is much much better than having to check each single page.



        For those interesed, here is my code (in Python 2.7):



        # -*- coding: utf-8 -*-
        #!/usr/bin/python
        #

        import re
        import pdb
        import os
        import errno
        import subprocess

        # Find pages that are not double page
        # OS: Ubuntu
        # Requirements: Python 2.7, pdftohtml


        def silentremove(filename):
        try:
        os.remove(filename)
        except OSError as e: # this would be "except OSError, e:" before Python 2.6
        if e.errno != errno.ENOENT: # errno.ENOENT = no such file or directory
        raise # re-raise exception if a different error occurred


        num_of_pages = 395
        input = "Lenin06.pdf"
        excps =
        i = 1
        with open(input, 'rt') as fid:
        while 1:
        if i > num_of_pages:
        break
        if (i == 1) or (i == 2):
        excps.append(str(i))
        i += 1
        continue
        if (i == 3) or (i == 4):
        i += 1
        continue
        cmd = "pdftohtml -i %s -f %d -l %d" % (input, i, i)
        os.system(cmd)
        html_file = input[:-4] + "s.html"
        with open(html_file, 'rt') as html_fid:
        for j in range(30):
        line = html_fid.readline()
        line = html_fid.readline()
        line = line.strip()

        if re.findall("img", line):
        excps.append(str(i))
        else:
        if re.findall('<a name=[0-9]*></a>&#[0-9]*;<br/>', line):
        excps.append(str(i))
        else:
        if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or
        re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
        re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
        re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):
        # Loi tua (Introduction)
        silentremove(input[:-4] + ".html")
        silentremove(input[:-4] + "_ind.html")
        silentremove(input[:-4] + "s.html")
        i += 1
        continue
        else:
        if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or
        re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
        re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
        re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
        re.findall('<a name=[0-9]*></a>V.  I.  L ª - n i n &#[0-9]*;<br/>', line):
        # print "haha"
        # Trang doi (Double page)
        silentremove(input[:-4] + ".html")
        silentremove(input[:-4] + "_ind.html")
        silentremove(input[:-4] + "s.html")
        i += 1
        continue
        else:
        if re.findall('<a name=[0-9]*></a>[^0-9&#;]* <br/>', line) and
        re.findall('^[0-9]*&#[0-9]*;<br/>$', html_fid.readline().strip()):
        # 1 so truong hop trang trai trong, trang phai co chu
        # (Some cases where the left page is blank while the right page contains
        # text)
        silentremove(input[:-4] + ".html")
        silentremove(input[:-4] + "_ind.html")
        silentremove(input[:-4] + "s.html")
        i += 1
        continue
        else:
        excps.append(str(i))
        pass
        pass
        pass
        silentremove(input[:-4] + ".html")
        silentremove(input[:-4] + "_ind.html")
        silentremove(input[:-4] + "s.html")
        i += 1
        pass
        for file in os.listdir("./"):
        if file.endswith(".png") or file.endswith(".jpg") or file.endswith(".jpeg"):
        silentremove(file)
        pass
        pdb.set_trace()


        And this is the file: https://drive.google.com/open?id=1vjnebt3xEuY8odhZHPwL8pf26l8ySdnE (this is just an example, I have many more that needs to be converted to single-pages)






        share|improve this answer













        After playing with several utilities from the




        poppler-utils




        package, I finally arrived at an acceptable, but not optimal, solution.



        It turns out that detecting double-pages in PDF files is a rather tricky business. I was unable to find any library that can do so easily. So in the end, I decided to use




        pdftohtml




        , which is a tool from the




        poppler-utils




        package, to convert each page into html, then use regular expression to extract pages that are not double pages. Interestingly, I was able to get most of the cases correctly by just using one or two lines in the html file. It does not work on all cases, as there are double-pages that are marked as single-page, but it seems that there is no single-page that is marked as double-page so there is no risk of damaging the original file.



        Here is what I did: I mostly depended on detecting the header number, which is in almost cases the first line of the html file (of course after several lines that are the same across all pages)



        I used the fact that in the introduction of the file, the header number uses Roman numbering, so I used the corresponding regex:



        if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or 
        re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
        re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
        re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):


        Another thing I noticed is that if the line (actually the 31st line, since the first 30 lines are the same across all the pages) contains image link then it is likely not needed to be cut in half (there are cases where the left page is blank and the right page contains an image, but these are few and far between, so I just have to iterate through each page in the result and remove those that are double-page). I simply search for the string "img".



        I also found out that double-pages contain the header number right at the beginning, so I simply used:



        if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or 
        re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
        re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
        re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
        re.findall('<a name=[0-9]*></a>V.  I.  L ª - n i n &#[0-9]*;<br/>', line):


        (the last line is due to some special pages that need special treatment)



        In the end, it does not detect all single-pages but the good thing is that it does not wrongly consider any single-page as double-page, so suppose the result is [1, 5, 100] then I can simply iterate through the list and check visually for each case. Although this is still not completely automated but this is much much better than having to check each single page.



        For those interesed, here is my code (in Python 2.7):



        # -*- coding: utf-8 -*-
        #!/usr/bin/python
        #

        import re
        import pdb
        import os
        import errno
        import subprocess

        # Find pages that are not double page
        # OS: Ubuntu
        # Requirements: Python 2.7, pdftohtml


        def silentremove(filename):
        try:
        os.remove(filename)
        except OSError as e: # this would be "except OSError, e:" before Python 2.6
        if e.errno != errno.ENOENT: # errno.ENOENT = no such file or directory
        raise # re-raise exception if a different error occurred


        num_of_pages = 395
        input = "Lenin06.pdf"
        excps =
        i = 1
        with open(input, 'rt') as fid:
        while 1:
        if i > num_of_pages:
        break
        if (i == 1) or (i == 2):
        excps.append(str(i))
        i += 1
        continue
        if (i == 3) or (i == 4):
        i += 1
        continue
        cmd = "pdftohtml -i %s -f %d -l %d" % (input, i, i)
        os.system(cmd)
        html_file = input[:-4] + "s.html"
        with open(html_file, 'rt') as html_fid:
        for j in range(30):
        line = html_fid.readline()
        line = html_fid.readline()
        line = line.strip()

        if re.findall("img", line):
        excps.append(str(i))
        else:
        if re.findall('<a name=[0-9]*></a>&#[0-9]*;<br/>', line):
        excps.append(str(i))
        else:
        if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or
        re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or
        re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or
        re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):
        # Loi tua (Introduction)
        silentremove(input[:-4] + ".html")
        silentremove(input[:-4] + "_ind.html")
        silentremove(input[:-4] + "s.html")
        i += 1
        continue
        else:
        if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or
        re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or
        re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or
        re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or
        re.findall('<a name=[0-9]*></a>V.  I.  L ª - n i n &#[0-9]*;<br/>', line):
        # print "haha"
        # Trang doi (Double page)
        silentremove(input[:-4] + ".html")
        silentremove(input[:-4] + "_ind.html")
        silentremove(input[:-4] + "s.html")
        i += 1
        continue
        else:
        if re.findall('<a name=[0-9]*></a>[^0-9&#;]* <br/>', line) and
        re.findall('^[0-9]*&#[0-9]*;<br/>$', html_fid.readline().strip()):
        # 1 so truong hop trang trai trong, trang phai co chu
        # (Some cases where the left page is blank while the right page contains
        # text)
        silentremove(input[:-4] + ".html")
        silentremove(input[:-4] + "_ind.html")
        silentremove(input[:-4] + "s.html")
        i += 1
        continue
        else:
        excps.append(str(i))
        pass
        pass
        pass
        silentremove(input[:-4] + ".html")
        silentremove(input[:-4] + "_ind.html")
        silentremove(input[:-4] + "s.html")
        i += 1
        pass
        for file in os.listdir("./"):
        if file.endswith(".png") or file.endswith(".jpg") or file.endswith(".jpeg"):
        silentremove(file)
        pass
        pdb.set_trace()


        And this is the file: https://drive.google.com/open?id=1vjnebt3xEuY8odhZHPwL8pf26l8ySdnE (this is just an example, I have many more that needs to be converted to single-pages)







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Mar 30 at 9:27









        Dang Manh TruongDang Manh Truong

        51119




        51119






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Ask Ubuntu!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1129677%2fhow-to-detect-pages-in-pdf-files-that-are-actually-two-pages%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How did Captain America manage to do this?

            迪纳利

            南乌拉尔铁路局