Cfxtrjtrk

Question

I have a PDF (it does not contain scanned images), each page of which is actually two-page, like this:

enter image description here

However there are some normal pages, so when I wrote a program to convert the file to normal pages I have to scroll through the file and identify the exception pages and write it to a list so that the program may know specifically which page not to cut in half (I used mutool for the cutting, it works for this type of file).

So how can I detect which page is normal and which page is not? Please help me, thank you very much.

You can try to play with pdfinfo from poppler-utils package (or other utils from it). — Mar 29 at 12:49
@N0rbert thank you very much, I found out that pdfinfo did not help much, but pdftohtml helped a lot, because I can use the html converted from the pdf as input to the regex and this helps me detect the single-pages. It is admittedly not optimal, but it is acceptable for me — Mar 30 at 9:29

Dang Manh TruongDang Manh Truong 51119 · Accepted Answer · 2019-03-30 09:27:27Z

After playing with several utilities from the

poppler-utils

package, I finally arrived at an acceptable, but not optimal, solution.

It turns out that detecting double-pages in PDF files is a rather tricky business. I was unable to find any library that can do so easily. So in the end, I decided to use

pdftohtml

, which is a tool from the

poppler-utils

package, to convert each page into html, then use regular expression to extract pages that are not double pages. Interestingly, I was able to get most of the cases correctly by just using one or two lines in the html file. It does not work on all cases, as there are double-pages that are marked as single-page, but it seems that there is no single-page that is marked as double-page so there is no risk of damaging the original file.

Here is what I did: I mostly depended on detecting the header number, which is in almost cases the first line of the html file (of course after several lines that are the same across all pages)

I used the fact that in the introduction of the file, the header number uses Roman numbering, so I used the corresponding regex:

if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or 

                            re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or 

                            re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or 

                            re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):

Another thing I noticed is that if the line (actually the 31st line, since the first 30 lines are the same across all the pages) contains image link then it is likely not needed to be cut in half (there are cases where the left page is blank and the right page contains an image, but these are few and far between, so I just have to iterate through each page in the result and remove those that are double-page). I simply search for the string "img".

I also found out that double-pages contain the header number right at the beginning, so I simply used:

if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or 

                                re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or 

                                re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or 

                                re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or 

                                re.findall('<a name=[0-9]*></a>V.  I.  L ª - n i n &#[0-9]*;<br/>', line):

(the last line is due to some special pages that need special treatment)

In the end, it does not detect all single-pages but the good thing is that it does not wrongly consider any single-page as double-page, so suppose the result is [1, 5, 100] then I can simply iterate through the list and check visually for each case. Although this is still not completely automated but this is much much better than having to check each single page.

For those interesed, here is my code (in Python 2.7):

# -*- coding: utf-8 -*-

#!/usr/bin/python

#



import re

import pdb

import os

import errno

import subprocess



# Find pages that are not double page

# OS: Ubuntu

# Requirements: Python 2.7, pdftohtml





def silentremove(filename):

    try:

        os.remove(filename)

    except OSError as e:  # this would be "except OSError, e:" before Python 2.6

        if e.errno != errno.ENOENT:  # errno.ENOENT = no such file or directory

            raise  # re-raise exception if a different error occurred





num_of_pages = 395

input = "Lenin06.pdf"

excps = 

i = 1

with open(input, 'rt') as fid:

    while 1:

        if i > num_of_pages:

            break

        if (i == 1) or (i == 2):

            excps.append(str(i))

            i += 1

            continue

        if (i == 3) or (i == 4):

            i += 1

            continue

        cmd = "pdftohtml -i %s -f %d -l %d" % (input, i, i)

        os.system(cmd)

        html_file = input[:-4] + "s.html"

        with open(html_file, 'rt') as html_fid:

            for j in range(30):

                line = html_fid.readline()

            line = html_fid.readline()

            line = line.strip()



            if re.findall("img", line):

                excps.append(str(i))

            else:

                if re.findall('<a name=[0-9]*></a>&#[0-9]*;<br/>', line):

                    excps.append(str(i))

                else:

                    if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or 

                            re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or 

                            re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or 

                            re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):

                        # Loi tua (Introduction)

                        silentremove(input[:-4] + ".html")

                        silentremove(input[:-4] + "_ind.html")

                        silentremove(input[:-4] + "s.html")

                        i += 1

                        continue

                    else:

                        if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or 

                                re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or 

                                re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or 

                                re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or 

                                re.findall('<a name=[0-9]*></a>V.  I.  L ª - n i n &#[0-9]*;<br/>', line):

                            # print "haha"

                            # Trang doi (Double page)

                            silentremove(input[:-4] + ".html")

                            silentremove(input[:-4] + "_ind.html")

                            silentremove(input[:-4] + "s.html")

                            i += 1

                            continue

                        else:

                            if re.findall('<a name=[0-9]*></a>[^0-9&#;]* <br/>', line) and 

                                    re.findall('^[0-9]*&#[0-9]*;<br/>$', html_fid.readline().strip()):

                                # 1 so truong hop trang trai trong, trang phai co chu

                                # (Some cases where the left page is blank while the right page contains

                                # text)

                                silentremove(input[:-4] + ".html")

                                silentremove(input[:-4] + "_ind.html")

                                silentremove(input[:-4] + "s.html")

                                i += 1

                                continue

                            else:

                                excps.append(str(i))

                        pass

                    pass

                pass

            silentremove(input[:-4] + ".html")

            silentremove(input[:-4] + "_ind.html")

            silentremove(input[:-4] + "s.html")

            i += 1

        pass

for file in os.listdir("./"):

    if file.endswith(".png") or file.endswith(".jpg") or file.endswith(".jpeg"):

        silentremove(file)

    pass

pdb.set_trace()

And this is the file: https://drive.google.com/open?id=1vjnebt3xEuY8odhZHPwL8pf26l8ySdnE (this is just an example, I have many more that needs to be converted to single-pages)

Dang Manh TruongDang Manh Truong 51119 · Accepted Answer · 2019-03-30 09:27:27Z

After playing with several utilities from the

poppler-utils

package, I finally arrived at an acceptable, but not optimal, solution.

It turns out that detecting double-pages in PDF files is a rather tricky business. I was unable to find any library that can do so easily. So in the end, I decided to use

pdftohtml

, which is a tool from the

poppler-utils

package, to convert each page into html, then use regular expression to extract pages that are not double pages. Interestingly, I was able to get most of the cases correctly by just using one or two lines in the html file. It does not work on all cases, as there are double-pages that are marked as single-page, but it seems that there is no single-page that is marked as double-page so there is no risk of damaging the original file.

Here is what I did: I mostly depended on detecting the header number, which is in almost cases the first line of the html file (of course after several lines that are the same across all pages)

I used the fact that in the introduction of the file, the header number uses Roman numbering, so I used the corresponding regex:

if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or 

                            re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or 

                            re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or 

                            re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):

Another thing I noticed is that if the line (actually the 31st line, since the first 30 lines are the same across all the pages) contains image link then it is likely not needed to be cut in half (there are cases where the left page is blank and the right page contains an image, but these are few and far between, so I just have to iterate through each page in the result and remove those that are double-page). I simply search for the string "img".

I also found out that double-pages contain the header number right at the beginning, so I simply used:

if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or 

                                re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or 

                                re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or 

                                re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or 

                                re.findall('<a name=[0-9]*></a>V.  I.  L ª - n i n &#[0-9]*;<br/>', line):

(the last line is due to some special pages that need special treatment)

In the end, it does not detect all single-pages but the good thing is that it does not wrongly consider any single-page as double-page, so suppose the result is [1, 5, 100] then I can simply iterate through the list and check visually for each case. Although this is still not completely automated but this is much much better than having to check each single page.

For those interesed, here is my code (in Python 2.7):

# -*- coding: utf-8 -*-

#!/usr/bin/python

#



import re

import pdb

import os

import errno

import subprocess



# Find pages that are not double page

# OS: Ubuntu

# Requirements: Python 2.7, pdftohtml





def silentremove(filename):

    try:

        os.remove(filename)

    except OSError as e:  # this would be "except OSError, e:" before Python 2.6

        if e.errno != errno.ENOENT:  # errno.ENOENT = no such file or directory

            raise  # re-raise exception if a different error occurred





num_of_pages = 395

input = "Lenin06.pdf"

excps = 

i = 1

with open(input, 'rt') as fid:

    while 1:

        if i > num_of_pages:

            break

        if (i == 1) or (i == 2):

            excps.append(str(i))

            i += 1

            continue

        if (i == 3) or (i == 4):

            i += 1

            continue

        cmd = "pdftohtml -i %s -f %d -l %d" % (input, i, i)

        os.system(cmd)

        html_file = input[:-4] + "s.html"

        with open(html_file, 'rt') as html_fid:

            for j in range(30):

                line = html_fid.readline()

            line = html_fid.readline()

            line = line.strip()



            if re.findall("img", line):

                excps.append(str(i))

            else:

                if re.findall('<a name=[0-9]*></a>&#[0-9]*;<br/>', line):

                    excps.append(str(i))

                else:

                    if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or 

                            re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or 

                            re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or 

                            re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):

                        # Loi tua (Introduction)

                        silentremove(input[:-4] + ".html")

                        silentremove(input[:-4] + "_ind.html")

                        silentremove(input[:-4] + "s.html")

                        i += 1

                        continue

                    else:

                        if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or 

                                re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or 

                                re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or 

                                re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or 

                                re.findall('<a name=[0-9]*></a>V.  I.  L ª - n i n &#[0-9]*;<br/>', line):

                            # print "haha"

                            # Trang doi (Double page)

                            silentremove(input[:-4] + ".html")

                            silentremove(input[:-4] + "_ind.html")

                            silentremove(input[:-4] + "s.html")

                            i += 1

                            continue

                        else:

                            if re.findall('<a name=[0-9]*></a>[^0-9&#;]* <br/>', line) and 

                                    re.findall('^[0-9]*&#[0-9]*;<br/>$', html_fid.readline().strip()):

                                # 1 so truong hop trang trai trong, trang phai co chu

                                # (Some cases where the left page is blank while the right page contains

                                # text)

                                silentremove(input[:-4] + ".html")

                                silentremove(input[:-4] + "_ind.html")

                                silentremove(input[:-4] + "s.html")

                                i += 1

                                continue

                            else:

                                excps.append(str(i))

                        pass

                    pass

                pass

            silentremove(input[:-4] + ".html")

            silentremove(input[:-4] + "_ind.html")

            silentremove(input[:-4] + "s.html")

            i += 1

        pass

for file in os.listdir("./"):

    if file.endswith(".png") or file.endswith(".jpg") or file.endswith(".jpeg"):

        silentremove(file)

    pass

pdb.set_trace()

And this is the file: https://drive.google.com/open?id=1vjnebt3xEuY8odhZHPwL8pf26l8ySdnE (this is just an example, I have many more that needs to be converted to single-pages)

搜尋此網誌

Cfxtrjtrk

How to detect pages in PDF files that are actually two-pages?

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

香港中文大學

波音707

C++ lambda syntax

How to detect pages in PDF files that are actually two-pages?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

香港中文大學

波音707

C++ lambda syntax

1 Answer
1

1 Answer
1

1 Answer
1