BASH Script hangs after some processing on Ubuntu
I have been running below script on a Red Hat server, and it works fine and finishes the job. The file I am feeding it, contains half a million lines in it (approximately 500000 lines), and that's why (to finish it faster) I have added an '&' at the end of while loop block
But now I have setup a Desktop with 8 GB of RAM running Ubuntu 18.04 on it, and running the same code only finishes a few thousand lines and then hangs. I read a bit about it and increased the stack limit to unlimited as well and still it hung after 80000 lines or so, Any suggestions about how can I optimize the code or tune my PC parameters to always finish the job?
while read -r CID60
do
{
OLT=$(echo "$CID60" | cut -d"|" -f5)
ONID=${OLT}:$(echo "$CID60" | cut -d, -f2 | sed 's/ //g ; s/).*|//')
echo $ONID,$(echo "$CID60" | cut -d"|" -f3) >> $localpath/CID_$logfile.csv
} &
done < $localpath/$CID7360
Input:
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ASSN45| Unlocked|12-654-0330|Up|202-00_MSRFKH00OL6|P282018767.C2028 ( network, R1.S1.LT7.PON8.ONT81.SERV1 )|
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ASSN46| Unlocked|12-654-0330|Down|202-00_MSRFKH00OL6|P282017856.C881 ( local, R1.S1.LT7.PON8.ONT81.C1.P1 )|
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ASSN52| Unlocked|12-664-1186|Up|202-00_MSRFKH00OL6|P282012623.C2028 ( network, R1.S1.LT7.PON8.ONT75.SERV1 )|
output:
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
my output of interest is 5th column ( separated with pipe |
) being concatenated with part of last column, and then the third column
bash text-processing background-process
add a comment |
I have been running below script on a Red Hat server, and it works fine and finishes the job. The file I am feeding it, contains half a million lines in it (approximately 500000 lines), and that's why (to finish it faster) I have added an '&' at the end of while loop block
But now I have setup a Desktop with 8 GB of RAM running Ubuntu 18.04 on it, and running the same code only finishes a few thousand lines and then hangs. I read a bit about it and increased the stack limit to unlimited as well and still it hung after 80000 lines or so, Any suggestions about how can I optimize the code or tune my PC parameters to always finish the job?
while read -r CID60
do
{
OLT=$(echo "$CID60" | cut -d"|" -f5)
ONID=${OLT}:$(echo "$CID60" | cut -d, -f2 | sed 's/ //g ; s/).*|//')
echo $ONID,$(echo "$CID60" | cut -d"|" -f3) >> $localpath/CID_$logfile.csv
} &
done < $localpath/$CID7360
Input:
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ASSN45| Unlocked|12-654-0330|Up|202-00_MSRFKH00OL6|P282018767.C2028 ( network, R1.S1.LT7.PON8.ONT81.SERV1 )|
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ASSN46| Unlocked|12-654-0330|Down|202-00_MSRFKH00OL6|P282017856.C881 ( local, R1.S1.LT7.PON8.ONT81.C1.P1 )|
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ASSN52| Unlocked|12-664-1186|Up|202-00_MSRFKH00OL6|P282012623.C2028 ( network, R1.S1.LT7.PON8.ONT75.SERV1 )|
output:
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
my output of interest is 5th column ( separated with pipe |
) being concatenated with part of last column, and then the third column
bash text-processing background-process
1
that's an awful lot of processes to fire off at, more or less, the same time. You might want towait
after some number of lines, or investigate other strategies to parallelize the job (such as GNU parallel)
– glenn jackman
2 days ago
@PerlDuck I have added the input and output of the script. of course it won't run as it is since some of the variables are defined out of this code. Also I am thinking to try sed or awk to do this job, it might be a lot quicker but I need to learn how to write such expression....
– Ibraheem
yesterday
@glennjackman I have been reading about parallel, can you suggest some way how I can use it in a loop like this one above?
– Ibraheem
yesterday
Your code seems amenable to a singlesed
instruction operating on the input file that would run thousands of times faster.awk
would also be a solution.
– xenoid
yesterday
@xenoid can you please suggest some sed expression?
– Ibraheem
yesterday
add a comment |
I have been running below script on a Red Hat server, and it works fine and finishes the job. The file I am feeding it, contains half a million lines in it (approximately 500000 lines), and that's why (to finish it faster) I have added an '&' at the end of while loop block
But now I have setup a Desktop with 8 GB of RAM running Ubuntu 18.04 on it, and running the same code only finishes a few thousand lines and then hangs. I read a bit about it and increased the stack limit to unlimited as well and still it hung after 80000 lines or so, Any suggestions about how can I optimize the code or tune my PC parameters to always finish the job?
while read -r CID60
do
{
OLT=$(echo "$CID60" | cut -d"|" -f5)
ONID=${OLT}:$(echo "$CID60" | cut -d, -f2 | sed 's/ //g ; s/).*|//')
echo $ONID,$(echo "$CID60" | cut -d"|" -f3) >> $localpath/CID_$logfile.csv
} &
done < $localpath/$CID7360
Input:
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ASSN45| Unlocked|12-654-0330|Up|202-00_MSRFKH00OL6|P282018767.C2028 ( network, R1.S1.LT7.PON8.ONT81.SERV1 )|
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ASSN46| Unlocked|12-654-0330|Down|202-00_MSRFKH00OL6|P282017856.C881 ( local, R1.S1.LT7.PON8.ONT81.C1.P1 )|
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ASSN52| Unlocked|12-664-1186|Up|202-00_MSRFKH00OL6|P282012623.C2028 ( network, R1.S1.LT7.PON8.ONT75.SERV1 )|
output:
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
my output of interest is 5th column ( separated with pipe |
) being concatenated with part of last column, and then the third column
bash text-processing background-process
I have been running below script on a Red Hat server, and it works fine and finishes the job. The file I am feeding it, contains half a million lines in it (approximately 500000 lines), and that's why (to finish it faster) I have added an '&' at the end of while loop block
But now I have setup a Desktop with 8 GB of RAM running Ubuntu 18.04 on it, and running the same code only finishes a few thousand lines and then hangs. I read a bit about it and increased the stack limit to unlimited as well and still it hung after 80000 lines or so, Any suggestions about how can I optimize the code or tune my PC parameters to always finish the job?
while read -r CID60
do
{
OLT=$(echo "$CID60" | cut -d"|" -f5)
ONID=${OLT}:$(echo "$CID60" | cut -d, -f2 | sed 's/ //g ; s/).*|//')
echo $ONID,$(echo "$CID60" | cut -d"|" -f3) >> $localpath/CID_$logfile.csv
} &
done < $localpath/$CID7360
Input:
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ASSN45| Unlocked|12-654-0330|Up|202-00_MSRFKH00OL6|P282018767.C2028 ( network, R1.S1.LT7.PON8.ONT81.SERV1 )|
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ASSN46| Unlocked|12-654-0330|Down|202-00_MSRFKH00OL6|P282017856.C881 ( local, R1.S1.LT7.PON8.ONT81.C1.P1 )|
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ASSN52| Unlocked|12-664-1186|Up|202-00_MSRFKH00OL6|P282012623.C2028 ( network, R1.S1.LT7.PON8.ONT75.SERV1 )|
output:
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
my output of interest is 5th column ( separated with pipe |
) being concatenated with part of last column, and then the third column
bash text-processing background-process
bash text-processing background-process
edited yesterday
GAD3R
1,523821
1,523821
asked 2 days ago
IbraheemIbraheem
185
185
1
that's an awful lot of processes to fire off at, more or less, the same time. You might want towait
after some number of lines, or investigate other strategies to parallelize the job (such as GNU parallel)
– glenn jackman
2 days ago
@PerlDuck I have added the input and output of the script. of course it won't run as it is since some of the variables are defined out of this code. Also I am thinking to try sed or awk to do this job, it might be a lot quicker but I need to learn how to write such expression....
– Ibraheem
yesterday
@glennjackman I have been reading about parallel, can you suggest some way how I can use it in a loop like this one above?
– Ibraheem
yesterday
Your code seems amenable to a singlesed
instruction operating on the input file that would run thousands of times faster.awk
would also be a solution.
– xenoid
yesterday
@xenoid can you please suggest some sed expression?
– Ibraheem
yesterday
add a comment |
1
that's an awful lot of processes to fire off at, more or less, the same time. You might want towait
after some number of lines, or investigate other strategies to parallelize the job (such as GNU parallel)
– glenn jackman
2 days ago
@PerlDuck I have added the input and output of the script. of course it won't run as it is since some of the variables are defined out of this code. Also I am thinking to try sed or awk to do this job, it might be a lot quicker but I need to learn how to write such expression....
– Ibraheem
yesterday
@glennjackman I have been reading about parallel, can you suggest some way how I can use it in a loop like this one above?
– Ibraheem
yesterday
Your code seems amenable to a singlesed
instruction operating on the input file that would run thousands of times faster.awk
would also be a solution.
– xenoid
yesterday
@xenoid can you please suggest some sed expression?
– Ibraheem
yesterday
1
1
that's an awful lot of processes to fire off at, more or less, the same time. You might want to
wait
after some number of lines, or investigate other strategies to parallelize the job (such as GNU parallel)– glenn jackman
2 days ago
that's an awful lot of processes to fire off at, more or less, the same time. You might want to
wait
after some number of lines, or investigate other strategies to parallelize the job (such as GNU parallel)– glenn jackman
2 days ago
@PerlDuck I have added the input and output of the script. of course it won't run as it is since some of the variables are defined out of this code. Also I am thinking to try sed or awk to do this job, it might be a lot quicker but I need to learn how to write such expression....
– Ibraheem
yesterday
@PerlDuck I have added the input and output of the script. of course it won't run as it is since some of the variables are defined out of this code. Also I am thinking to try sed or awk to do this job, it might be a lot quicker but I need to learn how to write such expression....
– Ibraheem
yesterday
@glennjackman I have been reading about parallel, can you suggest some way how I can use it in a loop like this one above?
– Ibraheem
yesterday
@glennjackman I have been reading about parallel, can you suggest some way how I can use it in a loop like this one above?
– Ibraheem
yesterday
Your code seems amenable to a single
sed
instruction operating on the input file that would run thousands of times faster. awk
would also be a solution.– xenoid
yesterday
Your code seems amenable to a single
sed
instruction operating on the input file that would run thousands of times faster. awk
would also be a solution.– xenoid
yesterday
@xenoid can you please suggest some sed expression?
– Ibraheem
yesterday
@xenoid can you please suggest some sed expression?
– Ibraheem
yesterday
add a comment |
4 Answers
4
active
oldest
votes
Perl solution
This script doesn't do anything in parallel but is quite fast regardless.
Save it as filter.pl
(or whatever name you prefer) and make it executable.
#!/usr/bin/env perl
use strict;
use warnings;
while( <> ) {
if ( /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/ ) {
print "$2:$3,$1n";
}
}
I copied your sample data until I got 1,572,864 lines and then ran it as follows:
me@ubuntu:~> time ./filter.pl < input.txt > output.txt
real 0m3,603s
user 0m3,487s
sys 0m0,100s
me@ubuntu:~> tail -3 output.txt
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
If you prefer one-liners, do:
perl -lne 'print "$2:$3,$1" if /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/;' < input.txt > output.txt
1
Indeed this perl solution has been fastest, took about less than a second to process 300K lines, I am preparing some other lookups like scripts, I will be looking forward to further help, thanks everyone, all were helpful, but @perlduck's solution was fastest, and as my original while loop wasn't producing results in order, so the order won't matter for me anyway
– Ibraheem
yesterday
@Ibraheem, Yes thisperl
solution is very good, probably with a great margin fast enough for your purpose. -- But try mytr
andcut
solution, which is actually faster in my computer (and I think easier to understand and modify), and wait for a solution withparallel
andperl
by PerlDuck, which I think can be the fastest of them all.
– sudodus
yesterday
@sudodus I tried your solution, it was really fast (took about 0.205 seconds), but the columns are not coming as I want them and it has a pipe in the middle,
– Ibraheem
yesterday
@Ibraheem, Is it important to have the format that you want (order of columns and separators between the column)? The reason why my solution is fast is that it does as little as possible, still showing what you need (but in a different order). If you prefer another separator, it is possible, space' '
would cost no extra time, another separator would cost some extra time for atr
ortr -s
command, but not very much.
– sudodus
yesterday
1
I finally made a oneliner withawk
, which is on par with theperl
oneliner (slightly faster in my computer), maybe easier to understand and edit, if you would need that in the future. The outputs of these two oneliners are exactly the same for the test case. See the end of my answer. Any of the two solutions should be good for you.
– sudodus
yesterday
|
show 2 more comments
A pure sed solution:
sed -r 's/^[^|]+|[^|]+|([^|]+)|[^|]+|([^|]+)|.+( .+, ([^ ]+).+/2:3,1/' <in.dat >out.dat
+1: Nice with a pure sed solution :-) But mycut
andsed
solution is faster ;-)
– sudodus
yesterday
Yes I know. But mine produces the result in the requested order 🤨🤨
– xenoid
yesterday
That's right, we will see how important it is to get exactly what the OP prescribes. By the way, I think you drop one character,MSRFKH00OL6
-->MSRFKH00OL
in your output. I think you can fix that with a minor edit.
– sudodus
yesterday
1
@sudodus Yes, transcription error. Fixed :)
– xenoid
yesterday
I timed your new one-liner and it works well, actually slightly faster than before. I don't know if there was something else happening in my computer, anyway, I edited my answer to show the new result :-)
– sudodus
yesterday
add a comment |
Oneliner
If the order of the items and the separators can be different from what you specify in the question, I thought the following one-liner would do it,
< input tr ' ' '|' | cut -d '|' -f 4,6,10 > output
but in a comment you wrote that you need exactly the specified format.
I added a solution with 'awk', which is approximately on par with PerlDuck's solution with perl
. See the end of this answer.
< input awk '{gsub("\|"," "); print $5 ":" $9 "," $3}' > output
Test
The test was done in my computer with Lubuntu 18.04.1 LTS, 2*2 processors and 4 GiB RAM.
I made a huge infile
by 'doubling 20 times' from your demo input
(1572864 lines), so some margin to your 500000 lines,
Oneliner with cut
and sed
:
$ < infile cut -d '|' -f 3,5,6 | sed -e 's/|[A-Z].*, /|/' -e 's/ )$//' > outfile
$ wc -l infile
1572864 infile
$ wc -l outfile
1572864 outfile
$ tail outfile
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
Timing
We might expect, that a pure sed
solution would be faster, but I think that reordering of the data slows it down, so that the cut
and sed
solution is faster. Both solutions work without any problem in my computer.
Oneliner with cut
and sed
:
$ time < infile cut -d '|' -f 3,5,6 | sed -e 's/|[A-Z].*, /|/' -e 's/ )$//' > outfile
real 0m8,132s
user 0m8,633s
sys 0m0,617s
A pure sed
oneliner by xenoid:
$ time sed -r 's/^[^|]+|[^|]+|([^|]+)|[^|]+|([^|]+)|.+( .+, ([^ ]+).+/2:3,1/' <infile > outfile-sed
real 1m8,686s
user 1m8,259s
sys 0m0,344s
A perl
oneliner by PerlDuck is faster than the previous oneliners:
$ time perl -lne 'print "$2:$3,$1" if /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/;' < infile > outfile.perl
real 0m5,929s
user 0m5,339s
sys 0m0,256s
Oneliner with tr
and cut
with a tr -s
command:
I used tr
to convert the spaces in the input file to pipeline characters and then cut
could do it all without sed
. As you can see, tr
is much faster than sed
. The tr -s
command removes double pipes in the input, which is a good idea, particularly if there can be repeated spaces or pipes in the input file. It does not cost much.
$ time < infile tr ' ' '|' | tr -s '|' '|' | cut -d '|' -f 3,5,9 > outfile-tr-cut
real 0m1,277s
user 0m1,781s
sys 0m0,925s
Oneliner with tr
and cut
without the tr -s
command, fastest so far:
time < infile tr ' ' '|' | cut -d '|' -f 4,6,10 > outfile-tr-cut
real 0m1,199s
user 0m1,020s
sys 0m0,618s
$ tail outfile-tr-cut
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
Oneliner with awk
, fast but not the fastest,
< input awk '{gsub("\|"," "); print $5 ":" $9 "," $3}' > output
$ time < infile awk '{gsub("\|"," "); print $5 ":" $9 "," $3}' > outfile.awk
real 0m5,091s
user 0m4,724s
sys 0m0,365s
Speed summary: the 'real' time according to time
rounded to 1 decimal
1m 8.7s - sed
8.1s - cut & sed
5.9s - perl
5.1s - awk
1.2s - tr & cut
Finally, I note that the oneliners with sed
, perl
and awk
create an output file with the prescribed format.
$ tail outfile.awk
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
2
Nice :-) Tryperl -lne 'print "$2:$3,$1" if /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/;' < input.txt > output.txt
. I also repeated the input until I got 1,572,864 lines and it runs in ~3,5 seconds on my machine.
– PerlDuck
yesterday
1
Note that the desired output is not|
separated but uses:
and,
.
– PerlDuck
yesterday
1
@PerlDuck, 1. Yes, I will time your perl expression :-) 2. I know (and wrote about it in the beginning of my answer), that it is not exactly what the OP wants, but similar enough to be useful, and, I think, faster than if rearranged to the exact specification.
– sudodus
yesterday
Sorry, I missed your introductory sentence about the different separators. Btw., this is an interesting approach using GNU parallel ;-)
– PerlDuck
yesterday
@PerlDuck, If you make an answer with your fastperl
oneliner, I will upvote it :-)
– sudodus
yesterday
|
show 6 more comments
Python
import sys,re
pattern=re.compile(r'^.+|.+|(.+)|.+|(.+)|.+, (.+) )|$')
for line in sys.stdin:
match=pattern.match(line)
if match:
print(match.group(2)+':'+match.group(3)+','+match.group(1))
(works with both Python2 and Python3)
Using a regex with non-greedy matches is 4x faster (avoids backtracking?) and puts python on par with the cut/sed method (python2 being a bit faster than python3)
import sys,re
pattern=re.compile(r'^[^|]+?|[^|]+?|([^|]+?)|[^|]+?|([^|]+?)|[^,]+?, (.+) )|$')
for line in sys.stdin:
match=pattern.match(line)
if match:
print(match.group(2)+':'+match.group(3)+','+match.group(1))
This one also works fine as expected but a bit slower then the perl one,
– Ibraheem
14 hours ago
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1114510%2fbash-script-hangs-after-some-processing-on-ubuntu%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
Perl solution
This script doesn't do anything in parallel but is quite fast regardless.
Save it as filter.pl
(or whatever name you prefer) and make it executable.
#!/usr/bin/env perl
use strict;
use warnings;
while( <> ) {
if ( /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/ ) {
print "$2:$3,$1n";
}
}
I copied your sample data until I got 1,572,864 lines and then ran it as follows:
me@ubuntu:~> time ./filter.pl < input.txt > output.txt
real 0m3,603s
user 0m3,487s
sys 0m0,100s
me@ubuntu:~> tail -3 output.txt
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
If you prefer one-liners, do:
perl -lne 'print "$2:$3,$1" if /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/;' < input.txt > output.txt
1
Indeed this perl solution has been fastest, took about less than a second to process 300K lines, I am preparing some other lookups like scripts, I will be looking forward to further help, thanks everyone, all were helpful, but @perlduck's solution was fastest, and as my original while loop wasn't producing results in order, so the order won't matter for me anyway
– Ibraheem
yesterday
@Ibraheem, Yes thisperl
solution is very good, probably with a great margin fast enough for your purpose. -- But try mytr
andcut
solution, which is actually faster in my computer (and I think easier to understand and modify), and wait for a solution withparallel
andperl
by PerlDuck, which I think can be the fastest of them all.
– sudodus
yesterday
@sudodus I tried your solution, it was really fast (took about 0.205 seconds), but the columns are not coming as I want them and it has a pipe in the middle,
– Ibraheem
yesterday
@Ibraheem, Is it important to have the format that you want (order of columns and separators between the column)? The reason why my solution is fast is that it does as little as possible, still showing what you need (but in a different order). If you prefer another separator, it is possible, space' '
would cost no extra time, another separator would cost some extra time for atr
ortr -s
command, but not very much.
– sudodus
yesterday
1
I finally made a oneliner withawk
, which is on par with theperl
oneliner (slightly faster in my computer), maybe easier to understand and edit, if you would need that in the future. The outputs of these two oneliners are exactly the same for the test case. See the end of my answer. Any of the two solutions should be good for you.
– sudodus
yesterday
|
show 2 more comments
Perl solution
This script doesn't do anything in parallel but is quite fast regardless.
Save it as filter.pl
(or whatever name you prefer) and make it executable.
#!/usr/bin/env perl
use strict;
use warnings;
while( <> ) {
if ( /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/ ) {
print "$2:$3,$1n";
}
}
I copied your sample data until I got 1,572,864 lines and then ran it as follows:
me@ubuntu:~> time ./filter.pl < input.txt > output.txt
real 0m3,603s
user 0m3,487s
sys 0m0,100s
me@ubuntu:~> tail -3 output.txt
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
If you prefer one-liners, do:
perl -lne 'print "$2:$3,$1" if /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/;' < input.txt > output.txt
1
Indeed this perl solution has been fastest, took about less than a second to process 300K lines, I am preparing some other lookups like scripts, I will be looking forward to further help, thanks everyone, all were helpful, but @perlduck's solution was fastest, and as my original while loop wasn't producing results in order, so the order won't matter for me anyway
– Ibraheem
yesterday
@Ibraheem, Yes thisperl
solution is very good, probably with a great margin fast enough for your purpose. -- But try mytr
andcut
solution, which is actually faster in my computer (and I think easier to understand and modify), and wait for a solution withparallel
andperl
by PerlDuck, which I think can be the fastest of them all.
– sudodus
yesterday
@sudodus I tried your solution, it was really fast (took about 0.205 seconds), but the columns are not coming as I want them and it has a pipe in the middle,
– Ibraheem
yesterday
@Ibraheem, Is it important to have the format that you want (order of columns and separators between the column)? The reason why my solution is fast is that it does as little as possible, still showing what you need (but in a different order). If you prefer another separator, it is possible, space' '
would cost no extra time, another separator would cost some extra time for atr
ortr -s
command, but not very much.
– sudodus
yesterday
1
I finally made a oneliner withawk
, which is on par with theperl
oneliner (slightly faster in my computer), maybe easier to understand and edit, if you would need that in the future. The outputs of these two oneliners are exactly the same for the test case. See the end of my answer. Any of the two solutions should be good for you.
– sudodus
yesterday
|
show 2 more comments
Perl solution
This script doesn't do anything in parallel but is quite fast regardless.
Save it as filter.pl
(or whatever name you prefer) and make it executable.
#!/usr/bin/env perl
use strict;
use warnings;
while( <> ) {
if ( /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/ ) {
print "$2:$3,$1n";
}
}
I copied your sample data until I got 1,572,864 lines and then ran it as follows:
me@ubuntu:~> time ./filter.pl < input.txt > output.txt
real 0m3,603s
user 0m3,487s
sys 0m0,100s
me@ubuntu:~> tail -3 output.txt
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
If you prefer one-liners, do:
perl -lne 'print "$2:$3,$1" if /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/;' < input.txt > output.txt
Perl solution
This script doesn't do anything in parallel but is quite fast regardless.
Save it as filter.pl
(or whatever name you prefer) and make it executable.
#!/usr/bin/env perl
use strict;
use warnings;
while( <> ) {
if ( /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/ ) {
print "$2:$3,$1n";
}
}
I copied your sample data until I got 1,572,864 lines and then ran it as follows:
me@ubuntu:~> time ./filter.pl < input.txt > output.txt
real 0m3,603s
user 0m3,487s
sys 0m0,100s
me@ubuntu:~> tail -3 output.txt
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
If you prefer one-liners, do:
perl -lne 'print "$2:$3,$1" if /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/;' < input.txt > output.txt
answered yesterday
PerlDuckPerlDuck
6,18211334
6,18211334
1
Indeed this perl solution has been fastest, took about less than a second to process 300K lines, I am preparing some other lookups like scripts, I will be looking forward to further help, thanks everyone, all were helpful, but @perlduck's solution was fastest, and as my original while loop wasn't producing results in order, so the order won't matter for me anyway
– Ibraheem
yesterday
@Ibraheem, Yes thisperl
solution is very good, probably with a great margin fast enough for your purpose. -- But try mytr
andcut
solution, which is actually faster in my computer (and I think easier to understand and modify), and wait for a solution withparallel
andperl
by PerlDuck, which I think can be the fastest of them all.
– sudodus
yesterday
@sudodus I tried your solution, it was really fast (took about 0.205 seconds), but the columns are not coming as I want them and it has a pipe in the middle,
– Ibraheem
yesterday
@Ibraheem, Is it important to have the format that you want (order of columns and separators between the column)? The reason why my solution is fast is that it does as little as possible, still showing what you need (but in a different order). If you prefer another separator, it is possible, space' '
would cost no extra time, another separator would cost some extra time for atr
ortr -s
command, but not very much.
– sudodus
yesterday
1
I finally made a oneliner withawk
, which is on par with theperl
oneliner (slightly faster in my computer), maybe easier to understand and edit, if you would need that in the future. The outputs of these two oneliners are exactly the same for the test case. See the end of my answer. Any of the two solutions should be good for you.
– sudodus
yesterday
|
show 2 more comments
1
Indeed this perl solution has been fastest, took about less than a second to process 300K lines, I am preparing some other lookups like scripts, I will be looking forward to further help, thanks everyone, all were helpful, but @perlduck's solution was fastest, and as my original while loop wasn't producing results in order, so the order won't matter for me anyway
– Ibraheem
yesterday
@Ibraheem, Yes thisperl
solution is very good, probably with a great margin fast enough for your purpose. -- But try mytr
andcut
solution, which is actually faster in my computer (and I think easier to understand and modify), and wait for a solution withparallel
andperl
by PerlDuck, which I think can be the fastest of them all.
– sudodus
yesterday
@sudodus I tried your solution, it was really fast (took about 0.205 seconds), but the columns are not coming as I want them and it has a pipe in the middle,
– Ibraheem
yesterday
@Ibraheem, Is it important to have the format that you want (order of columns and separators between the column)? The reason why my solution is fast is that it does as little as possible, still showing what you need (but in a different order). If you prefer another separator, it is possible, space' '
would cost no extra time, another separator would cost some extra time for atr
ortr -s
command, but not very much.
– sudodus
yesterday
1
I finally made a oneliner withawk
, which is on par with theperl
oneliner (slightly faster in my computer), maybe easier to understand and edit, if you would need that in the future. The outputs of these two oneliners are exactly the same for the test case. See the end of my answer. Any of the two solutions should be good for you.
– sudodus
yesterday
1
1
Indeed this perl solution has been fastest, took about less than a second to process 300K lines, I am preparing some other lookups like scripts, I will be looking forward to further help, thanks everyone, all were helpful, but @perlduck's solution was fastest, and as my original while loop wasn't producing results in order, so the order won't matter for me anyway
– Ibraheem
yesterday
Indeed this perl solution has been fastest, took about less than a second to process 300K lines, I am preparing some other lookups like scripts, I will be looking forward to further help, thanks everyone, all were helpful, but @perlduck's solution was fastest, and as my original while loop wasn't producing results in order, so the order won't matter for me anyway
– Ibraheem
yesterday
@Ibraheem, Yes this
perl
solution is very good, probably with a great margin fast enough for your purpose. -- But try my tr
and cut
solution, which is actually faster in my computer (and I think easier to understand and modify), and wait for a solution with parallel
and perl
by PerlDuck, which I think can be the fastest of them all.– sudodus
yesterday
@Ibraheem, Yes this
perl
solution is very good, probably with a great margin fast enough for your purpose. -- But try my tr
and cut
solution, which is actually faster in my computer (and I think easier to understand and modify), and wait for a solution with parallel
and perl
by PerlDuck, which I think can be the fastest of them all.– sudodus
yesterday
@sudodus I tried your solution, it was really fast (took about 0.205 seconds), but the columns are not coming as I want them and it has a pipe in the middle,
– Ibraheem
yesterday
@sudodus I tried your solution, it was really fast (took about 0.205 seconds), but the columns are not coming as I want them and it has a pipe in the middle,
– Ibraheem
yesterday
@Ibraheem, Is it important to have the format that you want (order of columns and separators between the column)? The reason why my solution is fast is that it does as little as possible, still showing what you need (but in a different order). If you prefer another separator, it is possible, space
' '
would cost no extra time, another separator would cost some extra time for a tr
or tr -s
command, but not very much.– sudodus
yesterday
@Ibraheem, Is it important to have the format that you want (order of columns and separators between the column)? The reason why my solution is fast is that it does as little as possible, still showing what you need (but in a different order). If you prefer another separator, it is possible, space
' '
would cost no extra time, another separator would cost some extra time for a tr
or tr -s
command, but not very much.– sudodus
yesterday
1
1
I finally made a oneliner with
awk
, which is on par with the perl
oneliner (slightly faster in my computer), maybe easier to understand and edit, if you would need that in the future. The outputs of these two oneliners are exactly the same for the test case. See the end of my answer. Any of the two solutions should be good for you.– sudodus
yesterday
I finally made a oneliner with
awk
, which is on par with the perl
oneliner (slightly faster in my computer), maybe easier to understand and edit, if you would need that in the future. The outputs of these two oneliners are exactly the same for the test case. See the end of my answer. Any of the two solutions should be good for you.– sudodus
yesterday
|
show 2 more comments
A pure sed solution:
sed -r 's/^[^|]+|[^|]+|([^|]+)|[^|]+|([^|]+)|.+( .+, ([^ ]+).+/2:3,1/' <in.dat >out.dat
+1: Nice with a pure sed solution :-) But mycut
andsed
solution is faster ;-)
– sudodus
yesterday
Yes I know. But mine produces the result in the requested order 🤨🤨
– xenoid
yesterday
That's right, we will see how important it is to get exactly what the OP prescribes. By the way, I think you drop one character,MSRFKH00OL6
-->MSRFKH00OL
in your output. I think you can fix that with a minor edit.
– sudodus
yesterday
1
@sudodus Yes, transcription error. Fixed :)
– xenoid
yesterday
I timed your new one-liner and it works well, actually slightly faster than before. I don't know if there was something else happening in my computer, anyway, I edited my answer to show the new result :-)
– sudodus
yesterday
add a comment |
A pure sed solution:
sed -r 's/^[^|]+|[^|]+|([^|]+)|[^|]+|([^|]+)|.+( .+, ([^ ]+).+/2:3,1/' <in.dat >out.dat
+1: Nice with a pure sed solution :-) But mycut
andsed
solution is faster ;-)
– sudodus
yesterday
Yes I know. But mine produces the result in the requested order 🤨🤨
– xenoid
yesterday
That's right, we will see how important it is to get exactly what the OP prescribes. By the way, I think you drop one character,MSRFKH00OL6
-->MSRFKH00OL
in your output. I think you can fix that with a minor edit.
– sudodus
yesterday
1
@sudodus Yes, transcription error. Fixed :)
– xenoid
yesterday
I timed your new one-liner and it works well, actually slightly faster than before. I don't know if there was something else happening in my computer, anyway, I edited my answer to show the new result :-)
– sudodus
yesterday
add a comment |
A pure sed solution:
sed -r 's/^[^|]+|[^|]+|([^|]+)|[^|]+|([^|]+)|.+( .+, ([^ ]+).+/2:3,1/' <in.dat >out.dat
A pure sed solution:
sed -r 's/^[^|]+|[^|]+|([^|]+)|[^|]+|([^|]+)|.+( .+, ([^ ]+).+/2:3,1/' <in.dat >out.dat
edited yesterday
answered yesterday
xenoidxenoid
1,5781416
1,5781416
+1: Nice with a pure sed solution :-) But mycut
andsed
solution is faster ;-)
– sudodus
yesterday
Yes I know. But mine produces the result in the requested order 🤨🤨
– xenoid
yesterday
That's right, we will see how important it is to get exactly what the OP prescribes. By the way, I think you drop one character,MSRFKH00OL6
-->MSRFKH00OL
in your output. I think you can fix that with a minor edit.
– sudodus
yesterday
1
@sudodus Yes, transcription error. Fixed :)
– xenoid
yesterday
I timed your new one-liner and it works well, actually slightly faster than before. I don't know if there was something else happening in my computer, anyway, I edited my answer to show the new result :-)
– sudodus
yesterday
add a comment |
+1: Nice with a pure sed solution :-) But mycut
andsed
solution is faster ;-)
– sudodus
yesterday
Yes I know. But mine produces the result in the requested order 🤨🤨
– xenoid
yesterday
That's right, we will see how important it is to get exactly what the OP prescribes. By the way, I think you drop one character,MSRFKH00OL6
-->MSRFKH00OL
in your output. I think you can fix that with a minor edit.
– sudodus
yesterday
1
@sudodus Yes, transcription error. Fixed :)
– xenoid
yesterday
I timed your new one-liner and it works well, actually slightly faster than before. I don't know if there was something else happening in my computer, anyway, I edited my answer to show the new result :-)
– sudodus
yesterday
+1: Nice with a pure sed solution :-) But my
cut
and sed
solution is faster ;-)– sudodus
yesterday
+1: Nice with a pure sed solution :-) But my
cut
and sed
solution is faster ;-)– sudodus
yesterday
Yes I know. But mine produces the result in the requested order 🤨🤨
– xenoid
yesterday
Yes I know. But mine produces the result in the requested order 🤨🤨
– xenoid
yesterday
That's right, we will see how important it is to get exactly what the OP prescribes. By the way, I think you drop one character,
MSRFKH00OL6
--> MSRFKH00OL
in your output. I think you can fix that with a minor edit.– sudodus
yesterday
That's right, we will see how important it is to get exactly what the OP prescribes. By the way, I think you drop one character,
MSRFKH00OL6
--> MSRFKH00OL
in your output. I think you can fix that with a minor edit.– sudodus
yesterday
1
1
@sudodus Yes, transcription error. Fixed :)
– xenoid
yesterday
@sudodus Yes, transcription error. Fixed :)
– xenoid
yesterday
I timed your new one-liner and it works well, actually slightly faster than before. I don't know if there was something else happening in my computer, anyway, I edited my answer to show the new result :-)
– sudodus
yesterday
I timed your new one-liner and it works well, actually slightly faster than before. I don't know if there was something else happening in my computer, anyway, I edited my answer to show the new result :-)
– sudodus
yesterday
add a comment |
Oneliner
If the order of the items and the separators can be different from what you specify in the question, I thought the following one-liner would do it,
< input tr ' ' '|' | cut -d '|' -f 4,6,10 > output
but in a comment you wrote that you need exactly the specified format.
I added a solution with 'awk', which is approximately on par with PerlDuck's solution with perl
. See the end of this answer.
< input awk '{gsub("\|"," "); print $5 ":" $9 "," $3}' > output
Test
The test was done in my computer with Lubuntu 18.04.1 LTS, 2*2 processors and 4 GiB RAM.
I made a huge infile
by 'doubling 20 times' from your demo input
(1572864 lines), so some margin to your 500000 lines,
Oneliner with cut
and sed
:
$ < infile cut -d '|' -f 3,5,6 | sed -e 's/|[A-Z].*, /|/' -e 's/ )$//' > outfile
$ wc -l infile
1572864 infile
$ wc -l outfile
1572864 outfile
$ tail outfile
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
Timing
We might expect, that a pure sed
solution would be faster, but I think that reordering of the data slows it down, so that the cut
and sed
solution is faster. Both solutions work without any problem in my computer.
Oneliner with cut
and sed
:
$ time < infile cut -d '|' -f 3,5,6 | sed -e 's/|[A-Z].*, /|/' -e 's/ )$//' > outfile
real 0m8,132s
user 0m8,633s
sys 0m0,617s
A pure sed
oneliner by xenoid:
$ time sed -r 's/^[^|]+|[^|]+|([^|]+)|[^|]+|([^|]+)|.+( .+, ([^ ]+).+/2:3,1/' <infile > outfile-sed
real 1m8,686s
user 1m8,259s
sys 0m0,344s
A perl
oneliner by PerlDuck is faster than the previous oneliners:
$ time perl -lne 'print "$2:$3,$1" if /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/;' < infile > outfile.perl
real 0m5,929s
user 0m5,339s
sys 0m0,256s
Oneliner with tr
and cut
with a tr -s
command:
I used tr
to convert the spaces in the input file to pipeline characters and then cut
could do it all without sed
. As you can see, tr
is much faster than sed
. The tr -s
command removes double pipes in the input, which is a good idea, particularly if there can be repeated spaces or pipes in the input file. It does not cost much.
$ time < infile tr ' ' '|' | tr -s '|' '|' | cut -d '|' -f 3,5,9 > outfile-tr-cut
real 0m1,277s
user 0m1,781s
sys 0m0,925s
Oneliner with tr
and cut
without the tr -s
command, fastest so far:
time < infile tr ' ' '|' | cut -d '|' -f 4,6,10 > outfile-tr-cut
real 0m1,199s
user 0m1,020s
sys 0m0,618s
$ tail outfile-tr-cut
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
Oneliner with awk
, fast but not the fastest,
< input awk '{gsub("\|"," "); print $5 ":" $9 "," $3}' > output
$ time < infile awk '{gsub("\|"," "); print $5 ":" $9 "," $3}' > outfile.awk
real 0m5,091s
user 0m4,724s
sys 0m0,365s
Speed summary: the 'real' time according to time
rounded to 1 decimal
1m 8.7s - sed
8.1s - cut & sed
5.9s - perl
5.1s - awk
1.2s - tr & cut
Finally, I note that the oneliners with sed
, perl
and awk
create an output file with the prescribed format.
$ tail outfile.awk
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
2
Nice :-) Tryperl -lne 'print "$2:$3,$1" if /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/;' < input.txt > output.txt
. I also repeated the input until I got 1,572,864 lines and it runs in ~3,5 seconds on my machine.
– PerlDuck
yesterday
1
Note that the desired output is not|
separated but uses:
and,
.
– PerlDuck
yesterday
1
@PerlDuck, 1. Yes, I will time your perl expression :-) 2. I know (and wrote about it in the beginning of my answer), that it is not exactly what the OP wants, but similar enough to be useful, and, I think, faster than if rearranged to the exact specification.
– sudodus
yesterday
Sorry, I missed your introductory sentence about the different separators. Btw., this is an interesting approach using GNU parallel ;-)
– PerlDuck
yesterday
@PerlDuck, If you make an answer with your fastperl
oneliner, I will upvote it :-)
– sudodus
yesterday
|
show 6 more comments
Oneliner
If the order of the items and the separators can be different from what you specify in the question, I thought the following one-liner would do it,
< input tr ' ' '|' | cut -d '|' -f 4,6,10 > output
but in a comment you wrote that you need exactly the specified format.
I added a solution with 'awk', which is approximately on par with PerlDuck's solution with perl
. See the end of this answer.
< input awk '{gsub("\|"," "); print $5 ":" $9 "," $3}' > output
Test
The test was done in my computer with Lubuntu 18.04.1 LTS, 2*2 processors and 4 GiB RAM.
I made a huge infile
by 'doubling 20 times' from your demo input
(1572864 lines), so some margin to your 500000 lines,
Oneliner with cut
and sed
:
$ < infile cut -d '|' -f 3,5,6 | sed -e 's/|[A-Z].*, /|/' -e 's/ )$//' > outfile
$ wc -l infile
1572864 infile
$ wc -l outfile
1572864 outfile
$ tail outfile
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
Timing
We might expect, that a pure sed
solution would be faster, but I think that reordering of the data slows it down, so that the cut
and sed
solution is faster. Both solutions work without any problem in my computer.
Oneliner with cut
and sed
:
$ time < infile cut -d '|' -f 3,5,6 | sed -e 's/|[A-Z].*, /|/' -e 's/ )$//' > outfile
real 0m8,132s
user 0m8,633s
sys 0m0,617s
A pure sed
oneliner by xenoid:
$ time sed -r 's/^[^|]+|[^|]+|([^|]+)|[^|]+|([^|]+)|.+( .+, ([^ ]+).+/2:3,1/' <infile > outfile-sed
real 1m8,686s
user 1m8,259s
sys 0m0,344s
A perl
oneliner by PerlDuck is faster than the previous oneliners:
$ time perl -lne 'print "$2:$3,$1" if /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/;' < infile > outfile.perl
real 0m5,929s
user 0m5,339s
sys 0m0,256s
Oneliner with tr
and cut
with a tr -s
command:
I used tr
to convert the spaces in the input file to pipeline characters and then cut
could do it all without sed
. As you can see, tr
is much faster than sed
. The tr -s
command removes double pipes in the input, which is a good idea, particularly if there can be repeated spaces or pipes in the input file. It does not cost much.
$ time < infile tr ' ' '|' | tr -s '|' '|' | cut -d '|' -f 3,5,9 > outfile-tr-cut
real 0m1,277s
user 0m1,781s
sys 0m0,925s
Oneliner with tr
and cut
without the tr -s
command, fastest so far:
time < infile tr ' ' '|' | cut -d '|' -f 4,6,10 > outfile-tr-cut
real 0m1,199s
user 0m1,020s
sys 0m0,618s
$ tail outfile-tr-cut
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
Oneliner with awk
, fast but not the fastest,
< input awk '{gsub("\|"," "); print $5 ":" $9 "," $3}' > output
$ time < infile awk '{gsub("\|"," "); print $5 ":" $9 "," $3}' > outfile.awk
real 0m5,091s
user 0m4,724s
sys 0m0,365s
Speed summary: the 'real' time according to time
rounded to 1 decimal
1m 8.7s - sed
8.1s - cut & sed
5.9s - perl
5.1s - awk
1.2s - tr & cut
Finally, I note that the oneliners with sed
, perl
and awk
create an output file with the prescribed format.
$ tail outfile.awk
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
2
Nice :-) Tryperl -lne 'print "$2:$3,$1" if /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/;' < input.txt > output.txt
. I also repeated the input until I got 1,572,864 lines and it runs in ~3,5 seconds on my machine.
– PerlDuck
yesterday
1
Note that the desired output is not|
separated but uses:
and,
.
– PerlDuck
yesterday
1
@PerlDuck, 1. Yes, I will time your perl expression :-) 2. I know (and wrote about it in the beginning of my answer), that it is not exactly what the OP wants, but similar enough to be useful, and, I think, faster than if rearranged to the exact specification.
– sudodus
yesterday
Sorry, I missed your introductory sentence about the different separators. Btw., this is an interesting approach using GNU parallel ;-)
– PerlDuck
yesterday
@PerlDuck, If you make an answer with your fastperl
oneliner, I will upvote it :-)
– sudodus
yesterday
|
show 6 more comments
Oneliner
If the order of the items and the separators can be different from what you specify in the question, I thought the following one-liner would do it,
< input tr ' ' '|' | cut -d '|' -f 4,6,10 > output
but in a comment you wrote that you need exactly the specified format.
I added a solution with 'awk', which is approximately on par with PerlDuck's solution with perl
. See the end of this answer.
< input awk '{gsub("\|"," "); print $5 ":" $9 "," $3}' > output
Test
The test was done in my computer with Lubuntu 18.04.1 LTS, 2*2 processors and 4 GiB RAM.
I made a huge infile
by 'doubling 20 times' from your demo input
(1572864 lines), so some margin to your 500000 lines,
Oneliner with cut
and sed
:
$ < infile cut -d '|' -f 3,5,6 | sed -e 's/|[A-Z].*, /|/' -e 's/ )$//' > outfile
$ wc -l infile
1572864 infile
$ wc -l outfile
1572864 outfile
$ tail outfile
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
Timing
We might expect, that a pure sed
solution would be faster, but I think that reordering of the data slows it down, so that the cut
and sed
solution is faster. Both solutions work without any problem in my computer.
Oneliner with cut
and sed
:
$ time < infile cut -d '|' -f 3,5,6 | sed -e 's/|[A-Z].*, /|/' -e 's/ )$//' > outfile
real 0m8,132s
user 0m8,633s
sys 0m0,617s
A pure sed
oneliner by xenoid:
$ time sed -r 's/^[^|]+|[^|]+|([^|]+)|[^|]+|([^|]+)|.+( .+, ([^ ]+).+/2:3,1/' <infile > outfile-sed
real 1m8,686s
user 1m8,259s
sys 0m0,344s
A perl
oneliner by PerlDuck is faster than the previous oneliners:
$ time perl -lne 'print "$2:$3,$1" if /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/;' < infile > outfile.perl
real 0m5,929s
user 0m5,339s
sys 0m0,256s
Oneliner with tr
and cut
with a tr -s
command:
I used tr
to convert the spaces in the input file to pipeline characters and then cut
could do it all without sed
. As you can see, tr
is much faster than sed
. The tr -s
command removes double pipes in the input, which is a good idea, particularly if there can be repeated spaces or pipes in the input file. It does not cost much.
$ time < infile tr ' ' '|' | tr -s '|' '|' | cut -d '|' -f 3,5,9 > outfile-tr-cut
real 0m1,277s
user 0m1,781s
sys 0m0,925s
Oneliner with tr
and cut
without the tr -s
command, fastest so far:
time < infile tr ' ' '|' | cut -d '|' -f 4,6,10 > outfile-tr-cut
real 0m1,199s
user 0m1,020s
sys 0m0,618s
$ tail outfile-tr-cut
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
Oneliner with awk
, fast but not the fastest,
< input awk '{gsub("\|"," "); print $5 ":" $9 "," $3}' > output
$ time < infile awk '{gsub("\|"," "); print $5 ":" $9 "," $3}' > outfile.awk
real 0m5,091s
user 0m4,724s
sys 0m0,365s
Speed summary: the 'real' time according to time
rounded to 1 decimal
1m 8.7s - sed
8.1s - cut & sed
5.9s - perl
5.1s - awk
1.2s - tr & cut
Finally, I note that the oneliners with sed
, perl
and awk
create an output file with the prescribed format.
$ tail outfile.awk
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
Oneliner
If the order of the items and the separators can be different from what you specify in the question, I thought the following one-liner would do it,
< input tr ' ' '|' | cut -d '|' -f 4,6,10 > output
but in a comment you wrote that you need exactly the specified format.
I added a solution with 'awk', which is approximately on par with PerlDuck's solution with perl
. See the end of this answer.
< input awk '{gsub("\|"," "); print $5 ":" $9 "," $3}' > output
Test
The test was done in my computer with Lubuntu 18.04.1 LTS, 2*2 processors and 4 GiB RAM.
I made a huge infile
by 'doubling 20 times' from your demo input
(1572864 lines), so some margin to your 500000 lines,
Oneliner with cut
and sed
:
$ < infile cut -d '|' -f 3,5,6 | sed -e 's/|[A-Z].*, /|/' -e 's/ )$//' > outfile
$ wc -l infile
1572864 infile
$ wc -l outfile
1572864 outfile
$ tail outfile
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
Timing
We might expect, that a pure sed
solution would be faster, but I think that reordering of the data slows it down, so that the cut
and sed
solution is faster. Both solutions work without any problem in my computer.
Oneliner with cut
and sed
:
$ time < infile cut -d '|' -f 3,5,6 | sed -e 's/|[A-Z].*, /|/' -e 's/ )$//' > outfile
real 0m8,132s
user 0m8,633s
sys 0m0,617s
A pure sed
oneliner by xenoid:
$ time sed -r 's/^[^|]+|[^|]+|([^|]+)|[^|]+|([^|]+)|.+( .+, ([^ ]+).+/2:3,1/' <infile > outfile-sed
real 1m8,686s
user 1m8,259s
sys 0m0,344s
A perl
oneliner by PerlDuck is faster than the previous oneliners:
$ time perl -lne 'print "$2:$3,$1" if /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/;' < infile > outfile.perl
real 0m5,929s
user 0m5,339s
sys 0m0,256s
Oneliner with tr
and cut
with a tr -s
command:
I used tr
to convert the spaces in the input file to pipeline characters and then cut
could do it all without sed
. As you can see, tr
is much faster than sed
. The tr -s
command removes double pipes in the input, which is a good idea, particularly if there can be repeated spaces or pipes in the input file. It does not cost much.
$ time < infile tr ' ' '|' | tr -s '|' '|' | cut -d '|' -f 3,5,9 > outfile-tr-cut
real 0m1,277s
user 0m1,781s
sys 0m0,925s
Oneliner with tr
and cut
without the tr -s
command, fastest so far:
time < infile tr ' ' '|' | cut -d '|' -f 4,6,10 > outfile-tr-cut
real 0m1,199s
user 0m1,020s
sys 0m0,618s
$ tail outfile-tr-cut
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.SERV1
12-654-0330|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT81.C1.P1
12-664-1186|202-00_MSRFKH00OL6|R1.S1.LT7.PON8.ONT75.SERV1
Oneliner with awk
, fast but not the fastest,
< input awk '{gsub("\|"," "); print $5 ":" $9 "," $3}' > output
$ time < infile awk '{gsub("\|"," "); print $5 ":" $9 "," $3}' > outfile.awk
real 0m5,091s
user 0m4,724s
sys 0m0,365s
Speed summary: the 'real' time according to time
rounded to 1 decimal
1m 8.7s - sed
8.1s - cut & sed
5.9s - perl
5.1s - awk
1.2s - tr & cut
Finally, I note that the oneliners with sed
, perl
and awk
create an output file with the prescribed format.
$ tail outfile.awk
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186
edited yesterday
answered yesterday
sudodussudodus
23.9k32874
23.9k32874
2
Nice :-) Tryperl -lne 'print "$2:$3,$1" if /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/;' < input.txt > output.txt
. I also repeated the input until I got 1,572,864 lines and it runs in ~3,5 seconds on my machine.
– PerlDuck
yesterday
1
Note that the desired output is not|
separated but uses:
and,
.
– PerlDuck
yesterday
1
@PerlDuck, 1. Yes, I will time your perl expression :-) 2. I know (and wrote about it in the beginning of my answer), that it is not exactly what the OP wants, but similar enough to be useful, and, I think, faster than if rearranged to the exact specification.
– sudodus
yesterday
Sorry, I missed your introductory sentence about the different separators. Btw., this is an interesting approach using GNU parallel ;-)
– PerlDuck
yesterday
@PerlDuck, If you make an answer with your fastperl
oneliner, I will upvote it :-)
– sudodus
yesterday
|
show 6 more comments
2
Nice :-) Tryperl -lne 'print "$2:$3,$1" if /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/;' < input.txt > output.txt
. I also repeated the input until I got 1,572,864 lines and it runs in ~3,5 seconds on my machine.
– PerlDuck
yesterday
1
Note that the desired output is not|
separated but uses:
and,
.
– PerlDuck
yesterday
1
@PerlDuck, 1. Yes, I will time your perl expression :-) 2. I know (and wrote about it in the beginning of my answer), that it is not exactly what the OP wants, but similar enough to be useful, and, I think, faster than if rearranged to the exact specification.
– sudodus
yesterday
Sorry, I missed your introductory sentence about the different separators. Btw., this is an interesting approach using GNU parallel ;-)
– PerlDuck
yesterday
@PerlDuck, If you make an answer with your fastperl
oneliner, I will upvote it :-)
– sudodus
yesterday
2
2
Nice :-) Try
perl -lne 'print "$2:$3,$1" if /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/;' < input.txt > output.txt
. I also repeated the input until I got 1,572,864 lines and it runs in ~3,5 seconds on my machine.– PerlDuck
yesterday
Nice :-) Try
perl -lne 'print "$2:$3,$1" if /^(?:[^|]+|){2}([^|]+)|[^|]+|([^|]+)|[^,]+,s*(S+)/;' < input.txt > output.txt
. I also repeated the input until I got 1,572,864 lines and it runs in ~3,5 seconds on my machine.– PerlDuck
yesterday
1
1
Note that the desired output is not
|
separated but uses :
and ,
.– PerlDuck
yesterday
Note that the desired output is not
|
separated but uses :
and ,
.– PerlDuck
yesterday
1
1
@PerlDuck, 1. Yes, I will time your perl expression :-) 2. I know (and wrote about it in the beginning of my answer), that it is not exactly what the OP wants, but similar enough to be useful, and, I think, faster than if rearranged to the exact specification.
– sudodus
yesterday
@PerlDuck, 1. Yes, I will time your perl expression :-) 2. I know (and wrote about it in the beginning of my answer), that it is not exactly what the OP wants, but similar enough to be useful, and, I think, faster than if rearranged to the exact specification.
– sudodus
yesterday
Sorry, I missed your introductory sentence about the different separators. Btw., this is an interesting approach using GNU parallel ;-)
– PerlDuck
yesterday
Sorry, I missed your introductory sentence about the different separators. Btw., this is an interesting approach using GNU parallel ;-)
– PerlDuck
yesterday
@PerlDuck, If you make an answer with your fast
perl
oneliner, I will upvote it :-)– sudodus
yesterday
@PerlDuck, If you make an answer with your fast
perl
oneliner, I will upvote it :-)– sudodus
yesterday
|
show 6 more comments
Python
import sys,re
pattern=re.compile(r'^.+|.+|(.+)|.+|(.+)|.+, (.+) )|$')
for line in sys.stdin:
match=pattern.match(line)
if match:
print(match.group(2)+':'+match.group(3)+','+match.group(1))
(works with both Python2 and Python3)
Using a regex with non-greedy matches is 4x faster (avoids backtracking?) and puts python on par with the cut/sed method (python2 being a bit faster than python3)
import sys,re
pattern=re.compile(r'^[^|]+?|[^|]+?|([^|]+?)|[^|]+?|([^|]+?)|[^,]+?, (.+) )|$')
for line in sys.stdin:
match=pattern.match(line)
if match:
print(match.group(2)+':'+match.group(3)+','+match.group(1))
This one also works fine as expected but a bit slower then the perl one,
– Ibraheem
14 hours ago
add a comment |
Python
import sys,re
pattern=re.compile(r'^.+|.+|(.+)|.+|(.+)|.+, (.+) )|$')
for line in sys.stdin:
match=pattern.match(line)
if match:
print(match.group(2)+':'+match.group(3)+','+match.group(1))
(works with both Python2 and Python3)
Using a regex with non-greedy matches is 4x faster (avoids backtracking?) and puts python on par with the cut/sed method (python2 being a bit faster than python3)
import sys,re
pattern=re.compile(r'^[^|]+?|[^|]+?|([^|]+?)|[^|]+?|([^|]+?)|[^,]+?, (.+) )|$')
for line in sys.stdin:
match=pattern.match(line)
if match:
print(match.group(2)+':'+match.group(3)+','+match.group(1))
This one also works fine as expected but a bit slower then the perl one,
– Ibraheem
14 hours ago
add a comment |
Python
import sys,re
pattern=re.compile(r'^.+|.+|(.+)|.+|(.+)|.+, (.+) )|$')
for line in sys.stdin:
match=pattern.match(line)
if match:
print(match.group(2)+':'+match.group(3)+','+match.group(1))
(works with both Python2 and Python3)
Using a regex with non-greedy matches is 4x faster (avoids backtracking?) and puts python on par with the cut/sed method (python2 being a bit faster than python3)
import sys,re
pattern=re.compile(r'^[^|]+?|[^|]+?|([^|]+?)|[^|]+?|([^|]+?)|[^,]+?, (.+) )|$')
for line in sys.stdin:
match=pattern.match(line)
if match:
print(match.group(2)+':'+match.group(3)+','+match.group(1))
Python
import sys,re
pattern=re.compile(r'^.+|.+|(.+)|.+|(.+)|.+, (.+) )|$')
for line in sys.stdin:
match=pattern.match(line)
if match:
print(match.group(2)+':'+match.group(3)+','+match.group(1))
(works with both Python2 and Python3)
Using a regex with non-greedy matches is 4x faster (avoids backtracking?) and puts python on par with the cut/sed method (python2 being a bit faster than python3)
import sys,re
pattern=re.compile(r'^[^|]+?|[^|]+?|([^|]+?)|[^|]+?|([^|]+?)|[^,]+?, (.+) )|$')
for line in sys.stdin:
match=pattern.match(line)
if match:
print(match.group(2)+':'+match.group(3)+','+match.group(1))
edited yesterday
answered yesterday
xenoidxenoid
1,5781416
1,5781416
This one also works fine as expected but a bit slower then the perl one,
– Ibraheem
14 hours ago
add a comment |
This one also works fine as expected but a bit slower then the perl one,
– Ibraheem
14 hours ago
This one also works fine as expected but a bit slower then the perl one,
– Ibraheem
14 hours ago
This one also works fine as expected but a bit slower then the perl one,
– Ibraheem
14 hours ago
add a comment |
Thanks for contributing an answer to Ask Ubuntu!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1114510%2fbash-script-hangs-after-some-processing-on-ubuntu%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
that's an awful lot of processes to fire off at, more or less, the same time. You might want to
wait
after some number of lines, or investigate other strategies to parallelize the job (such as GNU parallel)– glenn jackman
2 days ago
@PerlDuck I have added the input and output of the script. of course it won't run as it is since some of the variables are defined out of this code. Also I am thinking to try sed or awk to do this job, it might be a lot quicker but I need to learn how to write such expression....
– Ibraheem
yesterday
@glennjackman I have been reading about parallel, can you suggest some way how I can use it in a loop like this one above?
– Ibraheem
yesterday
Your code seems amenable to a single
sed
instruction operating on the input file that would run thousands of times faster.awk
would also be a solution.– xenoid
yesterday
@xenoid can you please suggest some sed expression?
– Ibraheem
yesterday