Word frequencies from large body of scraped text

I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.

Here is a snippet from the original file:

 1 środka(byłe

 1 środka.było

 1 środkacccxli.

 1 (środkach)

 1 „środkach”

 1 środkach

 1 środkach...

 1 środkach.",

 1 środkach"

 1 środkach".

 1 środkachwzorem

 1 środkach.życie

 1 środkajak

 1 "środkami"

 1 (środkami)

 1 „środkami”)

 1 środkami!"

 1 środkami”

 1 środkami)?

 1 środkami˝.

My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script. As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script.

Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:

# -*- coding: utf-8 -*-



import io

import pandas as pd

import numpy as np



num_batches = 54



for i in range(1 ,num_batches +1):



    infile_path = r'input_batch_' + str(i) + r'.txt'

    outfile_path = r'output_batch_' + str(i) + r'.txt'



    with io.open(infile_path, 'r', encoding = 'utf8') as infile, 

         io.open(outfile_path, 'w', encoding='utf8') as outfile:



        entries_raw = infile.readlines()

        entries_single = [x.strip() for x in entries_raw]

        entries = [x.split('t') for x in entries_single]



        data = pd.DataFrame({"word": , "freq": })



        for j in range(len(entries)):

            data.loc[j] = entries[j][1], entries[j][0]



            freq_dict = dict()

            keys = np.unique(data['word'])



            for key in keys:

                for x in range(len(data)):

                    if data['word'][x] == key:

                        if key in freq_dict:

                            prior_freq = freq_dict.get(key)

                            freq_dict[key] = prior_freq + data['freq'][x]

                        else:

                            freq_dict[key] = data['freq'][x]



        for key in freq_dict.keys():

        outfile.write("%s,%sn" % (key, freq_dict[key]))

The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for loops? Or by using a different data structure for word-frequency lookup than a dictionary?

edited 1 hour ago

asked 2 hours ago

Des Grieux

285

1

I've added the fixed code in one piece below. Thank you!
– Des Grieux
2 hours ago

add a comment |

Here is a snippet from the original file:

 1 środka(byłe

 1 środka.było

 1 środkacccxli.

 1 (środkach)

 1 „środkach”

 1 środkach

 1 środkach...

 1 środkach.",

 1 środkach"

 1 środkach".

 1 środkachwzorem

 1 środkach.życie

 1 środkajak

 1 "środkami"

 1 (środkami)

 1 „środkami”)

 1 środkami!"

 1 środkami”

 1 środkami)?

 1 środkami˝.

Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:

# -*- coding: utf-8 -*-



import io

import pandas as pd

import numpy as np



num_batches = 54



for i in range(1 ,num_batches +1):



    infile_path = r'input_batch_' + str(i) + r'.txt'

    outfile_path = r'output_batch_' + str(i) + r'.txt'



    with io.open(infile_path, 'r', encoding = 'utf8') as infile, 

         io.open(outfile_path, 'w', encoding='utf8') as outfile:



        entries_raw = infile.readlines()

        entries_single = [x.strip() for x in entries_raw]

        entries = [x.split('t') for x in entries_single]



        data = pd.DataFrame({"word": , "freq": })



        for j in range(len(entries)):

            data.loc[j] = entries[j][1], entries[j][0]



            freq_dict = dict()

            keys = np.unique(data['word'])



            for key in keys:

                for x in range(len(data)):

                    if data['word'][x] == key:

                        if key in freq_dict:

                            prior_freq = freq_dict.get(key)

                            freq_dict[key] = prior_freq + data['freq'][x]

                        else:

                            freq_dict[key] = data['freq'][x]



        for key in freq_dict.keys():

        outfile.write("%s,%sn" % (key, freq_dict[key]))

edited 1 hour ago

asked 2 hours ago

Des Grieux

285

1

I've added the fixed code in one piece below. Thank you!
– Des Grieux
2 hours ago

add a comment |

Here is a snippet from the original file:

 1 środka(byłe

 1 środka.było

 1 środkacccxli.

 1 (środkach)

 1 „środkach”

 1 środkach

 1 środkach...

 1 środkach.",

 1 środkach"

 1 środkach".

 1 środkachwzorem

 1 środkach.życie

 1 środkajak

 1 "środkami"

 1 (środkami)

 1 „środkami”)

 1 środkami!"

 1 środkami”

 1 środkami)?

 1 środkami˝.

Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:

# -*- coding: utf-8 -*-



import io

import pandas as pd

import numpy as np



num_batches = 54



for i in range(1 ,num_batches +1):



    infile_path = r'input_batch_' + str(i) + r'.txt'

    outfile_path = r'output_batch_' + str(i) + r'.txt'



    with io.open(infile_path, 'r', encoding = 'utf8') as infile, 

         io.open(outfile_path, 'w', encoding='utf8') as outfile:



        entries_raw = infile.readlines()

        entries_single = [x.strip() for x in entries_raw]

        entries = [x.split('t') for x in entries_single]



        data = pd.DataFrame({"word": , "freq": })



        for j in range(len(entries)):

            data.loc[j] = entries[j][1], entries[j][0]



            freq_dict = dict()

            keys = np.unique(data['word'])



            for key in keys:

                for x in range(len(data)):

                    if data['word'][x] == key:

                        if key in freq_dict:

                            prior_freq = freq_dict.get(key)

                            freq_dict[key] = prior_freq + data['freq'][x]

                        else:

                            freq_dict[key] = data['freq'][x]



        for key in freq_dict.keys():

        outfile.write("%s,%sn" % (key, freq_dict[key]))

edited 1 hour ago

asked 2 hours ago

Des Grieux

285

Here is a snippet from the original file:

 1 środka(byłe

 1 środka.było

 1 środkacccxli.

 1 (środkach)

 1 „środkach”

 1 środkach

 1 środkach...

 1 środkach.",

 1 środkach"

 1 środkach".

 1 środkachwzorem

 1 środkach.życie

 1 środkajak

 1 "środkami"

 1 (środkami)

 1 „środkami”)

 1 środkami!"

 1 środkami”

 1 środkami)?

 1 środkami˝.

Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:

# -*- coding: utf-8 -*-



import io

import pandas as pd

import numpy as np



num_batches = 54



for i in range(1 ,num_batches +1):



    infile_path = r'input_batch_' + str(i) + r'.txt'

    outfile_path = r'output_batch_' + str(i) + r'.txt'



    with io.open(infile_path, 'r', encoding = 'utf8') as infile, 

         io.open(outfile_path, 'w', encoding='utf8') as outfile:



        entries_raw = infile.readlines()

        entries_single = [x.strip() for x in entries_raw]

        entries = [x.split('t') for x in entries_single]



        data = pd.DataFrame({"word": , "freq": })



        for j in range(len(entries)):

            data.loc[j] = entries[j][1], entries[j][0]



            freq_dict = dict()

            keys = np.unique(data['word'])



            for key in keys:

                for x in range(len(data)):

                    if data['word'][x] == key:

                        if key in freq_dict:

                            prior_freq = freq_dict.get(key)

                            freq_dict[key] = prior_freq + data['freq'][x]

                        else:

                            freq_dict[key] = data['freq'][x]



        for key in freq_dict.keys():

        outfile.write("%s,%sn" % (key, freq_dict[key]))

python performance dictionary lookup

edited 1 hour ago

asked 2 hours ago

Des Grieux

285

edited 1 hour ago

asked 2 hours ago

Des Grieux

285

edited 1 hour ago

asked 2 hours ago

Des Grieux

285

asked 2 hours ago

Des Grieux

285

asked 2 hours ago

Des Grieux

285

1

I've added the fixed code in one piece below. Thank you!
– Des Grieux
2 hours ago

add a comment |

1

I've added the fixed code in one piece below. Thank you!
– Des Grieux
2 hours ago

I've added the fixed code in one piece below. Thank you!
– Des Grieux
2 hours ago

add a comment |

2 Answers
2

active

oldest

votes

Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:

from collections import Counter



yourListOfWords = [...]



frequencyOfEachWord = Counter(yourListOfWords)

answered 47 mins ago

AleksandrH

19919

add a comment |

for i in range(1 ,num_batches +1):

Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.

This string:

r'input_batch_' + str(i) + r'.txt'

can be:

f'input_batch_{i}.txt'

This code:

entries_raw = infile.readlines()

entries_single = [x.strip() for x in entries_raw]

entries = [x.split('t') for x in entries_single]

can also be simplified, to:

entries = [line.rstrip().split('t') for line in infile]

Note a few things. You don't need to call readlines(); you can treat the file object itself as an iterator. Also, avoid calling a variable x even if it's an intermediate variable; you need meaningful names.

This is an antipattern inherited from C:

for j in range(len(entries)):

    data.loc[j] = entries[j][1], entries[j][0]

You should instead do:

for j, entry in enumerate(entries):

    data.loc[j] = entry[1], entry[0]

That also applies to your for x in range(len(data)):.

This:

freq_dict = dict()

should be:

freq_dict = {}

This:

if key in freq_dict:

    prior_freq = freq_dict.get(key)

    freq_dict[key] = prior_freq + data['freq'][x]

else:

    freq_dict[key] = data['freq'][x]

can be simplified to:

freq_dict[key] = data['freq'][x]

prior_freq = freq_dict.get(key)

if prior_freq is not None:

    freq_dict[key] += prior_freq

Note a few things. First of all, you were inappropriately using get - either check for key presence and then use , or use get and then check the return value (which is preferred, as it requires fewer key lookups).

This loop:

for key in freq_dict.keys():

outfile.write("%s,%sn" % (key, freq_dict[key]))

needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:

for key, freq in freq_dict.items():

    outfile.write(f'{key},{freq}n')

answered 1 hour ago

Reinderien

2,226617

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210192%2fword-frequencies-from-large-body-of-scraped-text%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:

from collections import Counter



yourListOfWords = [...]



frequencyOfEachWord = Counter(yourListOfWords)

answered 47 mins ago

AleksandrH

19919

add a comment |

Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:

from collections import Counter



yourListOfWords = [...]



frequencyOfEachWord = Counter(yourListOfWords)

answered 47 mins ago

AleksandrH

19919

add a comment |

Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:

from collections import Counter



yourListOfWords = [...]



frequencyOfEachWord = Counter(yourListOfWords)

answered 47 mins ago

AleksandrH

19919

Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:

from collections import Counter



yourListOfWords = [...]



frequencyOfEachWord = Counter(yourListOfWords)

answered 47 mins ago

AleksandrH

19919

answered 47 mins ago

AleksandrH

19919

answered 47 mins ago

AleksandrH

19919

answered 47 mins ago

AleksandrH

19919

add a comment |

for i in range(1 ,num_batches +1):

Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.

This string:

r'input_batch_' + str(i) + r'.txt'

can be:

f'input_batch_{i}.txt'

This code:

entries_raw = infile.readlines()

entries_single = [x.strip() for x in entries_raw]

entries = [x.split('t') for x in entries_single]

can also be simplified, to:

entries = [line.rstrip().split('t') for line in infile]

This is an antipattern inherited from C:

for j in range(len(entries)):

    data.loc[j] = entries[j][1], entries[j][0]

You should instead do:

for j, entry in enumerate(entries):

    data.loc[j] = entry[1], entry[0]

That also applies to your for x in range(len(data)):.

This:

freq_dict = dict()

should be:

freq_dict = {}

This:

if key in freq_dict:

    prior_freq = freq_dict.get(key)

    freq_dict[key] = prior_freq + data['freq'][x]

else:

    freq_dict[key] = data['freq'][x]

can be simplified to:

freq_dict[key] = data['freq'][x]

prior_freq = freq_dict.get(key)

if prior_freq is not None:

    freq_dict[key] += prior_freq

This loop:

for key in freq_dict.keys():

outfile.write("%s,%sn" % (key, freq_dict[key]))

needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:

for key, freq in freq_dict.items():

    outfile.write(f'{key},{freq}n')

answered 1 hour ago

Reinderien

2,226617

add a comment |

for i in range(1 ,num_batches +1):

Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.

This string:

r'input_batch_' + str(i) + r'.txt'

can be:

f'input_batch_{i}.txt'

This code:

entries_raw = infile.readlines()

entries_single = [x.strip() for x in entries_raw]

entries = [x.split('t') for x in entries_single]

can also be simplified, to:

entries = [line.rstrip().split('t') for line in infile]

This is an antipattern inherited from C:

for j in range(len(entries)):

    data.loc[j] = entries[j][1], entries[j][0]

You should instead do:

for j, entry in enumerate(entries):

    data.loc[j] = entry[1], entry[0]

That also applies to your for x in range(len(data)):.

This:

freq_dict = dict()

should be:

freq_dict = {}

This:

if key in freq_dict:

    prior_freq = freq_dict.get(key)

    freq_dict[key] = prior_freq + data['freq'][x]

else:

    freq_dict[key] = data['freq'][x]

can be simplified to:

freq_dict[key] = data['freq'][x]

prior_freq = freq_dict.get(key)

if prior_freq is not None:

    freq_dict[key] += prior_freq

This loop:

for key in freq_dict.keys():

outfile.write("%s,%sn" % (key, freq_dict[key]))

needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:

for key, freq in freq_dict.items():

    outfile.write(f'{key},{freq}n')

answered 1 hour ago

Reinderien

2,226617

add a comment |

for i in range(1 ,num_batches +1):

Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.

This string:

r'input_batch_' + str(i) + r'.txt'

can be:

f'input_batch_{i}.txt'

This code:

entries_raw = infile.readlines()

entries_single = [x.strip() for x in entries_raw]

entries = [x.split('t') for x in entries_single]

can also be simplified, to:

entries = [line.rstrip().split('t') for line in infile]

This is an antipattern inherited from C:

for j in range(len(entries)):

    data.loc[j] = entries[j][1], entries[j][0]

You should instead do:

for j, entry in enumerate(entries):

    data.loc[j] = entry[1], entry[0]

That also applies to your for x in range(len(data)):.

This:

freq_dict = dict()

should be:

freq_dict = {}

This:

if key in freq_dict:

    prior_freq = freq_dict.get(key)

    freq_dict[key] = prior_freq + data['freq'][x]

else:

    freq_dict[key] = data['freq'][x]

can be simplified to:

freq_dict[key] = data['freq'][x]

prior_freq = freq_dict.get(key)

if prior_freq is not None:

    freq_dict[key] += prior_freq

This loop:

for key in freq_dict.keys():

outfile.write("%s,%sn" % (key, freq_dict[key]))

needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:

for key, freq in freq_dict.items():

    outfile.write(f'{key},{freq}n')

answered 1 hour ago

Reinderien

2,226617

for i in range(1 ,num_batches +1):

Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.

This string:

r'input_batch_' + str(i) + r'.txt'

can be:

f'input_batch_{i}.txt'

This code:

entries_raw = infile.readlines()

entries_single = [x.strip() for x in entries_raw]

entries = [x.split('t') for x in entries_single]

can also be simplified, to:

entries = [line.rstrip().split('t') for line in infile]

This is an antipattern inherited from C:

for j in range(len(entries)):

    data.loc[j] = entries[j][1], entries[j][0]

You should instead do:

for j, entry in enumerate(entries):

    data.loc[j] = entry[1], entry[0]

That also applies to your for x in range(len(data)):.

This:

freq_dict = dict()

should be:

freq_dict = {}

This:

if key in freq_dict:

    prior_freq = freq_dict.get(key)

    freq_dict[key] = prior_freq + data['freq'][x]

else:

    freq_dict[key] = data['freq'][x]

can be simplified to:

freq_dict[key] = data['freq'][x]

prior_freq = freq_dict.get(key)

if prior_freq is not None:

    freq_dict[key] += prior_freq

This loop:

for key in freq_dict.keys():

outfile.write("%s,%sn" % (key, freq_dict[key]))

needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:

for key, freq in freq_dict.items():

    outfile.write(f'{key},{freq}n')

answered 1 hour ago

Reinderien

2,226617

answered 1 hour ago

Reinderien

2,226617

answered 1 hour ago

Reinderien

2,226617

answered 1 hour ago

Reinderien

2,226617

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfxtrjtrk