Get first value that matches condition (loop too slow)
I've got a lot of matrices similar to this but with thousands of rows :
r <- 10
c <- 2
set.seed(333)
m1 <- matrix(runif(r*c)+1, r, c)
> m1
[,1] [,2]
[1,] 1.467001 1.393902
[2,] 1.084598 1.474218
[3,] 1.973485 1.891222
[4,] 1.571306 1.665011
[5,] 1.020119 1.736832
[6,] 1.723557 1.911469
[7,] 1.609394 1.637850
[8,] 1.306719 1.864651
[9,] 1.063510 1.287575
[10,] 1.305353 1.129959
I've got a loop that tells me, for each value of the first column, what is the index of the first value in the second column that is 10% higher like so :
result <- 1:nrow(m1)
for (i in 1:nrow(m1)){
result[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1]
}
> result
[1] 3 1 NA 3 1 6 3 2 1 2
I've got so much matrices that it's taking hours, and after profiling my code, the biggest time consuming task by far is this loop. What is, according to you, the fastest way to do it ?
For example, with r = 30000 :
start_time <- Sys.time()
for (i in 1:nrow(m1)){
result[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1]
}
end_time <- Sys.time()
a <- end_time - start_time
> a
Time difference of 11.25815 secs
Thanks for you help !
r
add a comment |
I've got a lot of matrices similar to this but with thousands of rows :
r <- 10
c <- 2
set.seed(333)
m1 <- matrix(runif(r*c)+1, r, c)
> m1
[,1] [,2]
[1,] 1.467001 1.393902
[2,] 1.084598 1.474218
[3,] 1.973485 1.891222
[4,] 1.571306 1.665011
[5,] 1.020119 1.736832
[6,] 1.723557 1.911469
[7,] 1.609394 1.637850
[8,] 1.306719 1.864651
[9,] 1.063510 1.287575
[10,] 1.305353 1.129959
I've got a loop that tells me, for each value of the first column, what is the index of the first value in the second column that is 10% higher like so :
result <- 1:nrow(m1)
for (i in 1:nrow(m1)){
result[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1]
}
> result
[1] 3 1 NA 3 1 6 3 2 1 2
I've got so much matrices that it's taking hours, and after profiling my code, the biggest time consuming task by far is this loop. What is, according to you, the fastest way to do it ?
For example, with r = 30000 :
start_time <- Sys.time()
for (i in 1:nrow(m1)){
result[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1]
}
end_time <- Sys.time()
a <- end_time - start_time
> a
Time difference of 11.25815 secs
Thanks for you help !
r
2
Is this for repeated application? If so, maybe you study the relevant algorithms and create a function withrcpp
or similar.
– sindri_baldur
yesterday
1
Also, as general advice, don't forget to consider the big picture. With profiling, it's easy to find that your code spends 99% of its time doing X and think "I need to find a faster way to do X," when often the question you should really be asking is "How can I avoid doing X so much?" (Of course, without seeing the rest of your code or knowing what its purpose is, we can't really help you with that here. But it's worth keeping in mind.)
– Ilmari Karonen
yesterday
add a comment |
I've got a lot of matrices similar to this but with thousands of rows :
r <- 10
c <- 2
set.seed(333)
m1 <- matrix(runif(r*c)+1, r, c)
> m1
[,1] [,2]
[1,] 1.467001 1.393902
[2,] 1.084598 1.474218
[3,] 1.973485 1.891222
[4,] 1.571306 1.665011
[5,] 1.020119 1.736832
[6,] 1.723557 1.911469
[7,] 1.609394 1.637850
[8,] 1.306719 1.864651
[9,] 1.063510 1.287575
[10,] 1.305353 1.129959
I've got a loop that tells me, for each value of the first column, what is the index of the first value in the second column that is 10% higher like so :
result <- 1:nrow(m1)
for (i in 1:nrow(m1)){
result[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1]
}
> result
[1] 3 1 NA 3 1 6 3 2 1 2
I've got so much matrices that it's taking hours, and after profiling my code, the biggest time consuming task by far is this loop. What is, according to you, the fastest way to do it ?
For example, with r = 30000 :
start_time <- Sys.time()
for (i in 1:nrow(m1)){
result[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1]
}
end_time <- Sys.time()
a <- end_time - start_time
> a
Time difference of 11.25815 secs
Thanks for you help !
r
I've got a lot of matrices similar to this but with thousands of rows :
r <- 10
c <- 2
set.seed(333)
m1 <- matrix(runif(r*c)+1, r, c)
> m1
[,1] [,2]
[1,] 1.467001 1.393902
[2,] 1.084598 1.474218
[3,] 1.973485 1.891222
[4,] 1.571306 1.665011
[5,] 1.020119 1.736832
[6,] 1.723557 1.911469
[7,] 1.609394 1.637850
[8,] 1.306719 1.864651
[9,] 1.063510 1.287575
[10,] 1.305353 1.129959
I've got a loop that tells me, for each value of the first column, what is the index of the first value in the second column that is 10% higher like so :
result <- 1:nrow(m1)
for (i in 1:nrow(m1)){
result[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1]
}
> result
[1] 3 1 NA 3 1 6 3 2 1 2
I've got so much matrices that it's taking hours, and after profiling my code, the biggest time consuming task by far is this loop. What is, according to you, the fastest way to do it ?
For example, with r = 30000 :
start_time <- Sys.time()
for (i in 1:nrow(m1)){
result[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1]
}
end_time <- Sys.time()
a <- end_time - start_time
> a
Time difference of 11.25815 secs
Thanks for you help !
r
r
edited 8 hours ago
Boog
asked yesterday
BoogBoog
785
785
2
Is this for repeated application? If so, maybe you study the relevant algorithms and create a function withrcpp
or similar.
– sindri_baldur
yesterday
1
Also, as general advice, don't forget to consider the big picture. With profiling, it's easy to find that your code spends 99% of its time doing X and think "I need to find a faster way to do X," when often the question you should really be asking is "How can I avoid doing X so much?" (Of course, without seeing the rest of your code or knowing what its purpose is, we can't really help you with that here. But it's worth keeping in mind.)
– Ilmari Karonen
yesterday
add a comment |
2
Is this for repeated application? If so, maybe you study the relevant algorithms and create a function withrcpp
or similar.
– sindri_baldur
yesterday
1
Also, as general advice, don't forget to consider the big picture. With profiling, it's easy to find that your code spends 99% of its time doing X and think "I need to find a faster way to do X," when often the question you should really be asking is "How can I avoid doing X so much?" (Of course, without seeing the rest of your code or knowing what its purpose is, we can't really help you with that here. But it's worth keeping in mind.)
– Ilmari Karonen
yesterday
2
2
Is this for repeated application? If so, maybe you study the relevant algorithms and create a function with
rcpp
or similar.– sindri_baldur
yesterday
Is this for repeated application? If so, maybe you study the relevant algorithms and create a function with
rcpp
or similar.– sindri_baldur
yesterday
1
1
Also, as general advice, don't forget to consider the big picture. With profiling, it's easy to find that your code spends 99% of its time doing X and think "I need to find a faster way to do X," when often the question you should really be asking is "How can I avoid doing X so much?" (Of course, without seeing the rest of your code or knowing what its purpose is, we can't really help you with that here. But it's worth keeping in mind.)
– Ilmari Karonen
yesterday
Also, as general advice, don't forget to consider the big picture. With profiling, it's easy to find that your code spends 99% of its time doing X and think "I need to find a faster way to do X," when often the question you should really be asking is "How can I avoid doing X so much?" (Of course, without seeing the rest of your code or knowing what its purpose is, we can't really help you with that here. But it's worth keeping in mind.)
– Ilmari Karonen
yesterday
add a comment |
5 Answers
5
active
oldest
votes
There are some shortcuts you can take here. You are looking for the first value in column 2 that is higher than some other value. This means that it is never worth looking at values that are lower than what we have previously seen in column 2.
In your example with 10 rows, that would be as follows:
> cummax(m1[, 2])
[1] 1.393902 1.474218 1.891222 1.891222 1.891222 1.911469 1.911469 1.911469 1.911469 1.911469
> which(cummax(m1[, 2]) == m1[, 2])
[1] 1 2 3 6
And as you can see, these are the only values in your result vector.
A second optimisation that could be made is to order the first column. If you start looking for the lowest value first, and work your way up, you don't have to look through the second column each time. You only have to step to the next row there if there are no matches with the left row anymore.
This does bear the cost of sorting the matrix, but afterwards the result can be found using a single pass through both columns.
dostuff <- function(m1){
orderColumn1 <- order(m1[, 1])
plus.10 <- m1[, 1] * 1.1
results <- rep(NA, length(plus.10))
IndexColumn1 <- 1
IndexColumn2 <- 1
row2CurrentMax <- 0
while(IndexColumn2 <= nrow(m1)){
row2Current <- m1[IndexColumn2, 2]
if(row2Current > row2CurrentMax){
row2CurrentMax <- row2Current
while(TRUE){
row1Current <- plus.10[orderColumn1[IndexColumn1]]
if(row1Current <= row2CurrentMax){
results[orderColumn1[IndexColumn1]] <- IndexColumn2
IndexColumn1 <- IndexColumn1 + 1
} else {
break
}
}
}
IndexColumn2 <- IndexColumn2 + 1
}
results
}
With 30000 rows:
> result <- dostuff(m1)
> end_time <- Sys.time()
> a <- end_time - start_time
> a
Time difference of 0.0600059 secs
This could probably be improved on usingrcpp
, but I am not familiar enough with that.
– JAD
yesterday
add a comment |
I don't imagine this is the fastest way but it will be somewhat faster than using the current for loop approach.
plus.10 <- m1[, 1] * 1.1
m2 <- m1[,2]
result <- sapply( plus.10, function(x) which.min(m2 < x))
result[plus.10 > max(m2) ] <- NA
result
[1] 3 1 NA 3 1 6 3 2 1 2
Edit: As requested by Ronak, microbenchmark
results of the solutions proposed so far on 10000 rows:
Unit: milliseconds
expr min lq mean median uq max neval cld
h1 335.342689 337.35915 361.320461 341.804840 347.856556 516.230972 25 b
sindri 672.587291 688.78673 758.445467 713.240778 811.298608 1049.109844 25 d
op 865.567412 884.99514 993.066179 1006.694036 1026.434344 1424.755409 25 e
loco 675.809092 682.98591 731.256313 693.672064 807.007358 821.893865 25 d
dmitry 420.869493 427.56492 454.439806 433.656519 438.367480 607.030825 25 c
jad 4.369628 4.41044 4.735393 4.503657 4.556527 7.488471 25 a
1
And if you first storem2 = m1[,2]
you can be faster.
– LocoGris
yesterday
@RonakShah - Yes, I didn't mean generally, I meant specifically the solution proposed by OP.
– H 1
yesterday
add a comment |
Here is an attempt using match()
which reduces the time compared to the r = 30000
example in the original post by about 25%
.
sapply(m1[, 1] * 1.1, function(x) match(TRUE, m1[, 2] > x))
[1] 3 1 NA 3 1 6 3 2 1 2
add a comment |
I propose these:
r <-30000
c <- 2
set.seed(333)
m1 <- matrix(runif(r*c)+1, r, c)
x2 <-m1[, 2]
start_time <- Sys.time()
result <- lapply(m1[, 1], function(x) {
min(which(m1[,2]>(1.1*x)))
})
end_time <- Sys.time()
a <- end_time - start_time
a
start_time <- Sys.time()
result <- lapply(m1[, 1], function(x) {
min(which(x2>(1.1*x)))
})
end_time <- Sys.time()
a <- end_time - start_time
a
First one: 8.6 s
Second one: 6.4 s
add a comment |
The best way to optimize your code is to use data.table
package
This code gives you > 2x speed up.
library(data.table);
setDTthreads(0);
r <- 30000;
c <- 2;
set.seed(333);
m1 <- matrix(runif(r*c)+1, r, c);
result1 <- rep(NA, nrow(m1));
start_time <- Sys.time();
for (i in 1:nrow(m1))
{
result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
}
#result1
end_time <- Sys.time()
a <- end_time - start_time
a
start_time <- Sys.time()
tstDT <- data.table(m1);
#result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]
#result2
end_time <- Sys.time()
a <- end_time - start_time
a
Little comment - I use data.table compiled by gcc with march=native and O3. Possible O2 and march=core (as in standard package by installation) speed up will be less, but...
Result:
> library(data.table);
>
> setDTthreads(0);
>
> r <- 30000;
> c <- 2;
> set.seed(333);
>
> m1 <- matrix(runif(r*c)+1, r, c);
> result1 <- rep(NA, nrow(m1));
>
> start_time <- Sys.time();
>
> for (i in 1:nrow(m1))
+ {
+ result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
+ }
>
> #result1
>
> end_time <- Sys.time()
> a <- end_time - start_time
> a
Time difference of 8.738938 secs
>
>
> start_time <- Sys.time()
>
> tstDT <- data.table(m1);
> #result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
> result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]
>
> #result2
>
> end_time <- Sys.time()
> a <- end_time - start_time
> a
Time difference of 3.582921 secs
>
>
>
>
1
This is definitely not the best way. As the other answers demonstrate much better ways, with bigger speedups.
– Konrad Rudolph
yesterday
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55274665%2fget-first-value-that-matches-condition-loop-too-slow%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
There are some shortcuts you can take here. You are looking for the first value in column 2 that is higher than some other value. This means that it is never worth looking at values that are lower than what we have previously seen in column 2.
In your example with 10 rows, that would be as follows:
> cummax(m1[, 2])
[1] 1.393902 1.474218 1.891222 1.891222 1.891222 1.911469 1.911469 1.911469 1.911469 1.911469
> which(cummax(m1[, 2]) == m1[, 2])
[1] 1 2 3 6
And as you can see, these are the only values in your result vector.
A second optimisation that could be made is to order the first column. If you start looking for the lowest value first, and work your way up, you don't have to look through the second column each time. You only have to step to the next row there if there are no matches with the left row anymore.
This does bear the cost of sorting the matrix, but afterwards the result can be found using a single pass through both columns.
dostuff <- function(m1){
orderColumn1 <- order(m1[, 1])
plus.10 <- m1[, 1] * 1.1
results <- rep(NA, length(plus.10))
IndexColumn1 <- 1
IndexColumn2 <- 1
row2CurrentMax <- 0
while(IndexColumn2 <= nrow(m1)){
row2Current <- m1[IndexColumn2, 2]
if(row2Current > row2CurrentMax){
row2CurrentMax <- row2Current
while(TRUE){
row1Current <- plus.10[orderColumn1[IndexColumn1]]
if(row1Current <= row2CurrentMax){
results[orderColumn1[IndexColumn1]] <- IndexColumn2
IndexColumn1 <- IndexColumn1 + 1
} else {
break
}
}
}
IndexColumn2 <- IndexColumn2 + 1
}
results
}
With 30000 rows:
> result <- dostuff(m1)
> end_time <- Sys.time()
> a <- end_time - start_time
> a
Time difference of 0.0600059 secs
This could probably be improved on usingrcpp
, but I am not familiar enough with that.
– JAD
yesterday
add a comment |
There are some shortcuts you can take here. You are looking for the first value in column 2 that is higher than some other value. This means that it is never worth looking at values that are lower than what we have previously seen in column 2.
In your example with 10 rows, that would be as follows:
> cummax(m1[, 2])
[1] 1.393902 1.474218 1.891222 1.891222 1.891222 1.911469 1.911469 1.911469 1.911469 1.911469
> which(cummax(m1[, 2]) == m1[, 2])
[1] 1 2 3 6
And as you can see, these are the only values in your result vector.
A second optimisation that could be made is to order the first column. If you start looking for the lowest value first, and work your way up, you don't have to look through the second column each time. You only have to step to the next row there if there are no matches with the left row anymore.
This does bear the cost of sorting the matrix, but afterwards the result can be found using a single pass through both columns.
dostuff <- function(m1){
orderColumn1 <- order(m1[, 1])
plus.10 <- m1[, 1] * 1.1
results <- rep(NA, length(plus.10))
IndexColumn1 <- 1
IndexColumn2 <- 1
row2CurrentMax <- 0
while(IndexColumn2 <= nrow(m1)){
row2Current <- m1[IndexColumn2, 2]
if(row2Current > row2CurrentMax){
row2CurrentMax <- row2Current
while(TRUE){
row1Current <- plus.10[orderColumn1[IndexColumn1]]
if(row1Current <= row2CurrentMax){
results[orderColumn1[IndexColumn1]] <- IndexColumn2
IndexColumn1 <- IndexColumn1 + 1
} else {
break
}
}
}
IndexColumn2 <- IndexColumn2 + 1
}
results
}
With 30000 rows:
> result <- dostuff(m1)
> end_time <- Sys.time()
> a <- end_time - start_time
> a
Time difference of 0.0600059 secs
This could probably be improved on usingrcpp
, but I am not familiar enough with that.
– JAD
yesterday
add a comment |
There are some shortcuts you can take here. You are looking for the first value in column 2 that is higher than some other value. This means that it is never worth looking at values that are lower than what we have previously seen in column 2.
In your example with 10 rows, that would be as follows:
> cummax(m1[, 2])
[1] 1.393902 1.474218 1.891222 1.891222 1.891222 1.911469 1.911469 1.911469 1.911469 1.911469
> which(cummax(m1[, 2]) == m1[, 2])
[1] 1 2 3 6
And as you can see, these are the only values in your result vector.
A second optimisation that could be made is to order the first column. If you start looking for the lowest value first, and work your way up, you don't have to look through the second column each time. You only have to step to the next row there if there are no matches with the left row anymore.
This does bear the cost of sorting the matrix, but afterwards the result can be found using a single pass through both columns.
dostuff <- function(m1){
orderColumn1 <- order(m1[, 1])
plus.10 <- m1[, 1] * 1.1
results <- rep(NA, length(plus.10))
IndexColumn1 <- 1
IndexColumn2 <- 1
row2CurrentMax <- 0
while(IndexColumn2 <= nrow(m1)){
row2Current <- m1[IndexColumn2, 2]
if(row2Current > row2CurrentMax){
row2CurrentMax <- row2Current
while(TRUE){
row1Current <- plus.10[orderColumn1[IndexColumn1]]
if(row1Current <= row2CurrentMax){
results[orderColumn1[IndexColumn1]] <- IndexColumn2
IndexColumn1 <- IndexColumn1 + 1
} else {
break
}
}
}
IndexColumn2 <- IndexColumn2 + 1
}
results
}
With 30000 rows:
> result <- dostuff(m1)
> end_time <- Sys.time()
> a <- end_time - start_time
> a
Time difference of 0.0600059 secs
There are some shortcuts you can take here. You are looking for the first value in column 2 that is higher than some other value. This means that it is never worth looking at values that are lower than what we have previously seen in column 2.
In your example with 10 rows, that would be as follows:
> cummax(m1[, 2])
[1] 1.393902 1.474218 1.891222 1.891222 1.891222 1.911469 1.911469 1.911469 1.911469 1.911469
> which(cummax(m1[, 2]) == m1[, 2])
[1] 1 2 3 6
And as you can see, these are the only values in your result vector.
A second optimisation that could be made is to order the first column. If you start looking for the lowest value first, and work your way up, you don't have to look through the second column each time. You only have to step to the next row there if there are no matches with the left row anymore.
This does bear the cost of sorting the matrix, but afterwards the result can be found using a single pass through both columns.
dostuff <- function(m1){
orderColumn1 <- order(m1[, 1])
plus.10 <- m1[, 1] * 1.1
results <- rep(NA, length(plus.10))
IndexColumn1 <- 1
IndexColumn2 <- 1
row2CurrentMax <- 0
while(IndexColumn2 <= nrow(m1)){
row2Current <- m1[IndexColumn2, 2]
if(row2Current > row2CurrentMax){
row2CurrentMax <- row2Current
while(TRUE){
row1Current <- plus.10[orderColumn1[IndexColumn1]]
if(row1Current <= row2CurrentMax){
results[orderColumn1[IndexColumn1]] <- IndexColumn2
IndexColumn1 <- IndexColumn1 + 1
} else {
break
}
}
}
IndexColumn2 <- IndexColumn2 + 1
}
results
}
With 30000 rows:
> result <- dostuff(m1)
> end_time <- Sys.time()
> a <- end_time - start_time
> a
Time difference of 0.0600059 secs
edited yesterday
answered yesterday
JADJAD
1,15711125
1,15711125
This could probably be improved on usingrcpp
, but I am not familiar enough with that.
– JAD
yesterday
add a comment |
This could probably be improved on usingrcpp
, but I am not familiar enough with that.
– JAD
yesterday
This could probably be improved on using
rcpp
, but I am not familiar enough with that.– JAD
yesterday
This could probably be improved on using
rcpp
, but I am not familiar enough with that.– JAD
yesterday
add a comment |
I don't imagine this is the fastest way but it will be somewhat faster than using the current for loop approach.
plus.10 <- m1[, 1] * 1.1
m2 <- m1[,2]
result <- sapply( plus.10, function(x) which.min(m2 < x))
result[plus.10 > max(m2) ] <- NA
result
[1] 3 1 NA 3 1 6 3 2 1 2
Edit: As requested by Ronak, microbenchmark
results of the solutions proposed so far on 10000 rows:
Unit: milliseconds
expr min lq mean median uq max neval cld
h1 335.342689 337.35915 361.320461 341.804840 347.856556 516.230972 25 b
sindri 672.587291 688.78673 758.445467 713.240778 811.298608 1049.109844 25 d
op 865.567412 884.99514 993.066179 1006.694036 1026.434344 1424.755409 25 e
loco 675.809092 682.98591 731.256313 693.672064 807.007358 821.893865 25 d
dmitry 420.869493 427.56492 454.439806 433.656519 438.367480 607.030825 25 c
jad 4.369628 4.41044 4.735393 4.503657 4.556527 7.488471 25 a
1
And if you first storem2 = m1[,2]
you can be faster.
– LocoGris
yesterday
@RonakShah - Yes, I didn't mean generally, I meant specifically the solution proposed by OP.
– H 1
yesterday
add a comment |
I don't imagine this is the fastest way but it will be somewhat faster than using the current for loop approach.
plus.10 <- m1[, 1] * 1.1
m2 <- m1[,2]
result <- sapply( plus.10, function(x) which.min(m2 < x))
result[plus.10 > max(m2) ] <- NA
result
[1] 3 1 NA 3 1 6 3 2 1 2
Edit: As requested by Ronak, microbenchmark
results of the solutions proposed so far on 10000 rows:
Unit: milliseconds
expr min lq mean median uq max neval cld
h1 335.342689 337.35915 361.320461 341.804840 347.856556 516.230972 25 b
sindri 672.587291 688.78673 758.445467 713.240778 811.298608 1049.109844 25 d
op 865.567412 884.99514 993.066179 1006.694036 1026.434344 1424.755409 25 e
loco 675.809092 682.98591 731.256313 693.672064 807.007358 821.893865 25 d
dmitry 420.869493 427.56492 454.439806 433.656519 438.367480 607.030825 25 c
jad 4.369628 4.41044 4.735393 4.503657 4.556527 7.488471 25 a
1
And if you first storem2 = m1[,2]
you can be faster.
– LocoGris
yesterday
@RonakShah - Yes, I didn't mean generally, I meant specifically the solution proposed by OP.
– H 1
yesterday
add a comment |
I don't imagine this is the fastest way but it will be somewhat faster than using the current for loop approach.
plus.10 <- m1[, 1] * 1.1
m2 <- m1[,2]
result <- sapply( plus.10, function(x) which.min(m2 < x))
result[plus.10 > max(m2) ] <- NA
result
[1] 3 1 NA 3 1 6 3 2 1 2
Edit: As requested by Ronak, microbenchmark
results of the solutions proposed so far on 10000 rows:
Unit: milliseconds
expr min lq mean median uq max neval cld
h1 335.342689 337.35915 361.320461 341.804840 347.856556 516.230972 25 b
sindri 672.587291 688.78673 758.445467 713.240778 811.298608 1049.109844 25 d
op 865.567412 884.99514 993.066179 1006.694036 1026.434344 1424.755409 25 e
loco 675.809092 682.98591 731.256313 693.672064 807.007358 821.893865 25 d
dmitry 420.869493 427.56492 454.439806 433.656519 438.367480 607.030825 25 c
jad 4.369628 4.41044 4.735393 4.503657 4.556527 7.488471 25 a
I don't imagine this is the fastest way but it will be somewhat faster than using the current for loop approach.
plus.10 <- m1[, 1] * 1.1
m2 <- m1[,2]
result <- sapply( plus.10, function(x) which.min(m2 < x))
result[plus.10 > max(m2) ] <- NA
result
[1] 3 1 NA 3 1 6 3 2 1 2
Edit: As requested by Ronak, microbenchmark
results of the solutions proposed so far on 10000 rows:
Unit: milliseconds
expr min lq mean median uq max neval cld
h1 335.342689 337.35915 361.320461 341.804840 347.856556 516.230972 25 b
sindri 672.587291 688.78673 758.445467 713.240778 811.298608 1049.109844 25 d
op 865.567412 884.99514 993.066179 1006.694036 1026.434344 1424.755409 25 e
loco 675.809092 682.98591 731.256313 693.672064 807.007358 821.893865 25 d
dmitry 420.869493 427.56492 454.439806 433.656519 438.367480 607.030825 25 c
jad 4.369628 4.41044 4.735393 4.503657 4.556527 7.488471 25 a
edited 22 hours ago
answered yesterday
H 1H 1
3,17211124
3,17211124
1
And if you first storem2 = m1[,2]
you can be faster.
– LocoGris
yesterday
@RonakShah - Yes, I didn't mean generally, I meant specifically the solution proposed by OP.
– H 1
yesterday
add a comment |
1
And if you first storem2 = m1[,2]
you can be faster.
– LocoGris
yesterday
@RonakShah - Yes, I didn't mean generally, I meant specifically the solution proposed by OP.
– H 1
yesterday
1
1
And if you first store
m2 = m1[,2]
you can be faster.– LocoGris
yesterday
And if you first store
m2 = m1[,2]
you can be faster.– LocoGris
yesterday
@RonakShah - Yes, I didn't mean generally, I meant specifically the solution proposed by OP.
– H 1
yesterday
@RonakShah - Yes, I didn't mean generally, I meant specifically the solution proposed by OP.
– H 1
yesterday
add a comment |
Here is an attempt using match()
which reduces the time compared to the r = 30000
example in the original post by about 25%
.
sapply(m1[, 1] * 1.1, function(x) match(TRUE, m1[, 2] > x))
[1] 3 1 NA 3 1 6 3 2 1 2
add a comment |
Here is an attempt using match()
which reduces the time compared to the r = 30000
example in the original post by about 25%
.
sapply(m1[, 1] * 1.1, function(x) match(TRUE, m1[, 2] > x))
[1] 3 1 NA 3 1 6 3 2 1 2
add a comment |
Here is an attempt using match()
which reduces the time compared to the r = 30000
example in the original post by about 25%
.
sapply(m1[, 1] * 1.1, function(x) match(TRUE, m1[, 2] > x))
[1] 3 1 NA 3 1 6 3 2 1 2
Here is an attempt using match()
which reduces the time compared to the r = 30000
example in the original post by about 25%
.
sapply(m1[, 1] * 1.1, function(x) match(TRUE, m1[, 2] > x))
[1] 3 1 NA 3 1 6 3 2 1 2
edited yesterday
answered yesterday
sindri_baldursindri_baldur
8,2551033
8,2551033
add a comment |
add a comment |
I propose these:
r <-30000
c <- 2
set.seed(333)
m1 <- matrix(runif(r*c)+1, r, c)
x2 <-m1[, 2]
start_time <- Sys.time()
result <- lapply(m1[, 1], function(x) {
min(which(m1[,2]>(1.1*x)))
})
end_time <- Sys.time()
a <- end_time - start_time
a
start_time <- Sys.time()
result <- lapply(m1[, 1], function(x) {
min(which(x2>(1.1*x)))
})
end_time <- Sys.time()
a <- end_time - start_time
a
First one: 8.6 s
Second one: 6.4 s
add a comment |
I propose these:
r <-30000
c <- 2
set.seed(333)
m1 <- matrix(runif(r*c)+1, r, c)
x2 <-m1[, 2]
start_time <- Sys.time()
result <- lapply(m1[, 1], function(x) {
min(which(m1[,2]>(1.1*x)))
})
end_time <- Sys.time()
a <- end_time - start_time
a
start_time <- Sys.time()
result <- lapply(m1[, 1], function(x) {
min(which(x2>(1.1*x)))
})
end_time <- Sys.time()
a <- end_time - start_time
a
First one: 8.6 s
Second one: 6.4 s
add a comment |
I propose these:
r <-30000
c <- 2
set.seed(333)
m1 <- matrix(runif(r*c)+1, r, c)
x2 <-m1[, 2]
start_time <- Sys.time()
result <- lapply(m1[, 1], function(x) {
min(which(m1[,2]>(1.1*x)))
})
end_time <- Sys.time()
a <- end_time - start_time
a
start_time <- Sys.time()
result <- lapply(m1[, 1], function(x) {
min(which(x2>(1.1*x)))
})
end_time <- Sys.time()
a <- end_time - start_time
a
First one: 8.6 s
Second one: 6.4 s
I propose these:
r <-30000
c <- 2
set.seed(333)
m1 <- matrix(runif(r*c)+1, r, c)
x2 <-m1[, 2]
start_time <- Sys.time()
result <- lapply(m1[, 1], function(x) {
min(which(m1[,2]>(1.1*x)))
})
end_time <- Sys.time()
a <- end_time - start_time
a
start_time <- Sys.time()
result <- lapply(m1[, 1], function(x) {
min(which(x2>(1.1*x)))
})
end_time <- Sys.time()
a <- end_time - start_time
a
First one: 8.6 s
Second one: 6.4 s
answered yesterday
LocoGrisLocoGris
2,0471725
2,0471725
add a comment |
add a comment |
The best way to optimize your code is to use data.table
package
This code gives you > 2x speed up.
library(data.table);
setDTthreads(0);
r <- 30000;
c <- 2;
set.seed(333);
m1 <- matrix(runif(r*c)+1, r, c);
result1 <- rep(NA, nrow(m1));
start_time <- Sys.time();
for (i in 1:nrow(m1))
{
result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
}
#result1
end_time <- Sys.time()
a <- end_time - start_time
a
start_time <- Sys.time()
tstDT <- data.table(m1);
#result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]
#result2
end_time <- Sys.time()
a <- end_time - start_time
a
Little comment - I use data.table compiled by gcc with march=native and O3. Possible O2 and march=core (as in standard package by installation) speed up will be less, but...
Result:
> library(data.table);
>
> setDTthreads(0);
>
> r <- 30000;
> c <- 2;
> set.seed(333);
>
> m1 <- matrix(runif(r*c)+1, r, c);
> result1 <- rep(NA, nrow(m1));
>
> start_time <- Sys.time();
>
> for (i in 1:nrow(m1))
+ {
+ result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
+ }
>
> #result1
>
> end_time <- Sys.time()
> a <- end_time - start_time
> a
Time difference of 8.738938 secs
>
>
> start_time <- Sys.time()
>
> tstDT <- data.table(m1);
> #result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
> result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]
>
> #result2
>
> end_time <- Sys.time()
> a <- end_time - start_time
> a
Time difference of 3.582921 secs
>
>
>
>
1
This is definitely not the best way. As the other answers demonstrate much better ways, with bigger speedups.
– Konrad Rudolph
yesterday
add a comment |
The best way to optimize your code is to use data.table
package
This code gives you > 2x speed up.
library(data.table);
setDTthreads(0);
r <- 30000;
c <- 2;
set.seed(333);
m1 <- matrix(runif(r*c)+1, r, c);
result1 <- rep(NA, nrow(m1));
start_time <- Sys.time();
for (i in 1:nrow(m1))
{
result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
}
#result1
end_time <- Sys.time()
a <- end_time - start_time
a
start_time <- Sys.time()
tstDT <- data.table(m1);
#result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]
#result2
end_time <- Sys.time()
a <- end_time - start_time
a
Little comment - I use data.table compiled by gcc with march=native and O3. Possible O2 and march=core (as in standard package by installation) speed up will be less, but...
Result:
> library(data.table);
>
> setDTthreads(0);
>
> r <- 30000;
> c <- 2;
> set.seed(333);
>
> m1 <- matrix(runif(r*c)+1, r, c);
> result1 <- rep(NA, nrow(m1));
>
> start_time <- Sys.time();
>
> for (i in 1:nrow(m1))
+ {
+ result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
+ }
>
> #result1
>
> end_time <- Sys.time()
> a <- end_time - start_time
> a
Time difference of 8.738938 secs
>
>
> start_time <- Sys.time()
>
> tstDT <- data.table(m1);
> #result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
> result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]
>
> #result2
>
> end_time <- Sys.time()
> a <- end_time - start_time
> a
Time difference of 3.582921 secs
>
>
>
>
1
This is definitely not the best way. As the other answers demonstrate much better ways, with bigger speedups.
– Konrad Rudolph
yesterday
add a comment |
The best way to optimize your code is to use data.table
package
This code gives you > 2x speed up.
library(data.table);
setDTthreads(0);
r <- 30000;
c <- 2;
set.seed(333);
m1 <- matrix(runif(r*c)+1, r, c);
result1 <- rep(NA, nrow(m1));
start_time <- Sys.time();
for (i in 1:nrow(m1))
{
result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
}
#result1
end_time <- Sys.time()
a <- end_time - start_time
a
start_time <- Sys.time()
tstDT <- data.table(m1);
#result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]
#result2
end_time <- Sys.time()
a <- end_time - start_time
a
Little comment - I use data.table compiled by gcc with march=native and O3. Possible O2 and march=core (as in standard package by installation) speed up will be less, but...
Result:
> library(data.table);
>
> setDTthreads(0);
>
> r <- 30000;
> c <- 2;
> set.seed(333);
>
> m1 <- matrix(runif(r*c)+1, r, c);
> result1 <- rep(NA, nrow(m1));
>
> start_time <- Sys.time();
>
> for (i in 1:nrow(m1))
+ {
+ result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
+ }
>
> #result1
>
> end_time <- Sys.time()
> a <- end_time - start_time
> a
Time difference of 8.738938 secs
>
>
> start_time <- Sys.time()
>
> tstDT <- data.table(m1);
> #result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
> result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]
>
> #result2
>
> end_time <- Sys.time()
> a <- end_time - start_time
> a
Time difference of 3.582921 secs
>
>
>
>
The best way to optimize your code is to use data.table
package
This code gives you > 2x speed up.
library(data.table);
setDTthreads(0);
r <- 30000;
c <- 2;
set.seed(333);
m1 <- matrix(runif(r*c)+1, r, c);
result1 <- rep(NA, nrow(m1));
start_time <- Sys.time();
for (i in 1:nrow(m1))
{
result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
}
#result1
end_time <- Sys.time()
a <- end_time - start_time
a
start_time <- Sys.time()
tstDT <- data.table(m1);
#result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]
#result2
end_time <- Sys.time()
a <- end_time - start_time
a
Little comment - I use data.table compiled by gcc with march=native and O3. Possible O2 and march=core (as in standard package by installation) speed up will be less, but...
Result:
> library(data.table);
>
> setDTthreads(0);
>
> r <- 30000;
> c <- 2;
> set.seed(333);
>
> m1 <- matrix(runif(r*c)+1, r, c);
> result1 <- rep(NA, nrow(m1));
>
> start_time <- Sys.time();
>
> for (i in 1:nrow(m1))
+ {
+ result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
+ }
>
> #result1
>
> end_time <- Sys.time()
> a <- end_time - start_time
> a
Time difference of 8.738938 secs
>
>
> start_time <- Sys.time()
>
> tstDT <- data.table(m1);
> #result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
> result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]
>
> #result2
>
> end_time <- Sys.time()
> a <- end_time - start_time
> a
Time difference of 3.582921 secs
>
>
>
>
edited yesterday
answered yesterday
DmitriyDmitriy
314314
314314
1
This is definitely not the best way. As the other answers demonstrate much better ways, with bigger speedups.
– Konrad Rudolph
yesterday
add a comment |
1
This is definitely not the best way. As the other answers demonstrate much better ways, with bigger speedups.
– Konrad Rudolph
yesterday
1
1
This is definitely not the best way. As the other answers demonstrate much better ways, with bigger speedups.
– Konrad Rudolph
yesterday
This is definitely not the best way. As the other answers demonstrate much better ways, with bigger speedups.
– Konrad Rudolph
yesterday
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55274665%2fget-first-value-that-matches-condition-loop-too-slow%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
Is this for repeated application? If so, maybe you study the relevant algorithms and create a function with
rcpp
or similar.– sindri_baldur
yesterday
1
Also, as general advice, don't forget to consider the big picture. With profiling, it's easy to find that your code spends 99% of its time doing X and think "I need to find a faster way to do X," when often the question you should really be asking is "How can I avoid doing X so much?" (Of course, without seeing the rest of your code or knowing what its purpose is, we can't really help you with that here. But it's worth keeping in mind.)
– Ilmari Karonen
yesterday