Get first value that matches condition (loop too slow)












14















I've got a lot of matrices similar to this but with thousands of rows :



r <- 10
c <- 2
set.seed(333)

m1 <- matrix(runif(r*c)+1, r, c)

> m1
[,1] [,2]
[1,] 1.467001 1.393902
[2,] 1.084598 1.474218
[3,] 1.973485 1.891222
[4,] 1.571306 1.665011
[5,] 1.020119 1.736832
[6,] 1.723557 1.911469
[7,] 1.609394 1.637850
[8,] 1.306719 1.864651
[9,] 1.063510 1.287575
[10,] 1.305353 1.129959


I've got a loop that tells me, for each value of the first column, what is the index of the first value in the second column that is 10% higher like so :



result <- 1:nrow(m1)

for (i in 1:nrow(m1)){
result[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1]
}
> result
[1] 3 1 NA 3 1 6 3 2 1 2


I've got so much matrices that it's taking hours, and after profiling my code, the biggest time consuming task by far is this loop. What is, according to you, the fastest way to do it ?



For example, with r = 30000 :



start_time <- Sys.time()

for (i in 1:nrow(m1)){
result[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1]
}

end_time <- Sys.time()
a <- end_time - start_time

> a
Time difference of 11.25815 secs


Thanks for you help !










share|improve this question




















  • 2





    Is this for repeated application? If so, maybe you study the relevant algorithms and create a function with rcpp or similar.

    – sindri_baldur
    yesterday






  • 1





    Also, as general advice, don't forget to consider the big picture. With profiling, it's easy to find that your code spends 99% of its time doing X and think "I need to find a faster way to do X," when often the question you should really be asking is "How can I avoid doing X so much?" (Of course, without seeing the rest of your code or knowing what its purpose is, we can't really help you with that here. But it's worth keeping in mind.)

    – Ilmari Karonen
    yesterday
















14















I've got a lot of matrices similar to this but with thousands of rows :



r <- 10
c <- 2
set.seed(333)

m1 <- matrix(runif(r*c)+1, r, c)

> m1
[,1] [,2]
[1,] 1.467001 1.393902
[2,] 1.084598 1.474218
[3,] 1.973485 1.891222
[4,] 1.571306 1.665011
[5,] 1.020119 1.736832
[6,] 1.723557 1.911469
[7,] 1.609394 1.637850
[8,] 1.306719 1.864651
[9,] 1.063510 1.287575
[10,] 1.305353 1.129959


I've got a loop that tells me, for each value of the first column, what is the index of the first value in the second column that is 10% higher like so :



result <- 1:nrow(m1)

for (i in 1:nrow(m1)){
result[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1]
}
> result
[1] 3 1 NA 3 1 6 3 2 1 2


I've got so much matrices that it's taking hours, and after profiling my code, the biggest time consuming task by far is this loop. What is, according to you, the fastest way to do it ?



For example, with r = 30000 :



start_time <- Sys.time()

for (i in 1:nrow(m1)){
result[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1]
}

end_time <- Sys.time()
a <- end_time - start_time

> a
Time difference of 11.25815 secs


Thanks for you help !










share|improve this question




















  • 2





    Is this for repeated application? If so, maybe you study the relevant algorithms and create a function with rcpp or similar.

    – sindri_baldur
    yesterday






  • 1





    Also, as general advice, don't forget to consider the big picture. With profiling, it's easy to find that your code spends 99% of its time doing X and think "I need to find a faster way to do X," when often the question you should really be asking is "How can I avoid doing X so much?" (Of course, without seeing the rest of your code or knowing what its purpose is, we can't really help you with that here. But it's worth keeping in mind.)

    – Ilmari Karonen
    yesterday














14












14








14


1






I've got a lot of matrices similar to this but with thousands of rows :



r <- 10
c <- 2
set.seed(333)

m1 <- matrix(runif(r*c)+1, r, c)

> m1
[,1] [,2]
[1,] 1.467001 1.393902
[2,] 1.084598 1.474218
[3,] 1.973485 1.891222
[4,] 1.571306 1.665011
[5,] 1.020119 1.736832
[6,] 1.723557 1.911469
[7,] 1.609394 1.637850
[8,] 1.306719 1.864651
[9,] 1.063510 1.287575
[10,] 1.305353 1.129959


I've got a loop that tells me, for each value of the first column, what is the index of the first value in the second column that is 10% higher like so :



result <- 1:nrow(m1)

for (i in 1:nrow(m1)){
result[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1]
}
> result
[1] 3 1 NA 3 1 6 3 2 1 2


I've got so much matrices that it's taking hours, and after profiling my code, the biggest time consuming task by far is this loop. What is, according to you, the fastest way to do it ?



For example, with r = 30000 :



start_time <- Sys.time()

for (i in 1:nrow(m1)){
result[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1]
}

end_time <- Sys.time()
a <- end_time - start_time

> a
Time difference of 11.25815 secs


Thanks for you help !










share|improve this question
















I've got a lot of matrices similar to this but with thousands of rows :



r <- 10
c <- 2
set.seed(333)

m1 <- matrix(runif(r*c)+1, r, c)

> m1
[,1] [,2]
[1,] 1.467001 1.393902
[2,] 1.084598 1.474218
[3,] 1.973485 1.891222
[4,] 1.571306 1.665011
[5,] 1.020119 1.736832
[6,] 1.723557 1.911469
[7,] 1.609394 1.637850
[8,] 1.306719 1.864651
[9,] 1.063510 1.287575
[10,] 1.305353 1.129959


I've got a loop that tells me, for each value of the first column, what is the index of the first value in the second column that is 10% higher like so :



result <- 1:nrow(m1)

for (i in 1:nrow(m1)){
result[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1]
}
> result
[1] 3 1 NA 3 1 6 3 2 1 2


I've got so much matrices that it's taking hours, and after profiling my code, the biggest time consuming task by far is this loop. What is, according to you, the fastest way to do it ?



For example, with r = 30000 :



start_time <- Sys.time()

for (i in 1:nrow(m1)){
result[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1]
}

end_time <- Sys.time()
a <- end_time - start_time

> a
Time difference of 11.25815 secs


Thanks for you help !







r






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 8 hours ago







Boog

















asked yesterday









BoogBoog

785




785








  • 2





    Is this for repeated application? If so, maybe you study the relevant algorithms and create a function with rcpp or similar.

    – sindri_baldur
    yesterday






  • 1





    Also, as general advice, don't forget to consider the big picture. With profiling, it's easy to find that your code spends 99% of its time doing X and think "I need to find a faster way to do X," when often the question you should really be asking is "How can I avoid doing X so much?" (Of course, without seeing the rest of your code or knowing what its purpose is, we can't really help you with that here. But it's worth keeping in mind.)

    – Ilmari Karonen
    yesterday














  • 2





    Is this for repeated application? If so, maybe you study the relevant algorithms and create a function with rcpp or similar.

    – sindri_baldur
    yesterday






  • 1





    Also, as general advice, don't forget to consider the big picture. With profiling, it's easy to find that your code spends 99% of its time doing X and think "I need to find a faster way to do X," when often the question you should really be asking is "How can I avoid doing X so much?" (Of course, without seeing the rest of your code or knowing what its purpose is, we can't really help you with that here. But it's worth keeping in mind.)

    – Ilmari Karonen
    yesterday








2




2





Is this for repeated application? If so, maybe you study the relevant algorithms and create a function with rcpp or similar.

– sindri_baldur
yesterday





Is this for repeated application? If so, maybe you study the relevant algorithms and create a function with rcpp or similar.

– sindri_baldur
yesterday




1




1





Also, as general advice, don't forget to consider the big picture. With profiling, it's easy to find that your code spends 99% of its time doing X and think "I need to find a faster way to do X," when often the question you should really be asking is "How can I avoid doing X so much?" (Of course, without seeing the rest of your code or knowing what its purpose is, we can't really help you with that here. But it's worth keeping in mind.)

– Ilmari Karonen
yesterday





Also, as general advice, don't forget to consider the big picture. With profiling, it's easy to find that your code spends 99% of its time doing X and think "I need to find a faster way to do X," when often the question you should really be asking is "How can I avoid doing X so much?" (Of course, without seeing the rest of your code or knowing what its purpose is, we can't really help you with that here. But it's worth keeping in mind.)

– Ilmari Karonen
yesterday












5 Answers
5






active

oldest

votes


















6














There are some shortcuts you can take here. You are looking for the first value in column 2 that is higher than some other value. This means that it is never worth looking at values that are lower than what we have previously seen in column 2.



In your example with 10 rows, that would be as follows:



> cummax(m1[, 2])
[1] 1.393902 1.474218 1.891222 1.891222 1.891222 1.911469 1.911469 1.911469 1.911469 1.911469
> which(cummax(m1[, 2]) == m1[, 2])
[1] 1 2 3 6


And as you can see, these are the only values in your result vector.



A second optimisation that could be made is to order the first column. If you start looking for the lowest value first, and work your way up, you don't have to look through the second column each time. You only have to step to the next row there if there are no matches with the left row anymore.



This does bear the cost of sorting the matrix, but afterwards the result can be found using a single pass through both columns.



dostuff <- function(m1){
orderColumn1 <- order(m1[, 1])

plus.10 <- m1[, 1] * 1.1

results <- rep(NA, length(plus.10))

IndexColumn1 <- 1
IndexColumn2 <- 1
row2CurrentMax <- 0
while(IndexColumn2 <= nrow(m1)){
row2Current <- m1[IndexColumn2, 2]
if(row2Current > row2CurrentMax){
row2CurrentMax <- row2Current
while(TRUE){
row1Current <- plus.10[orderColumn1[IndexColumn1]]
if(row1Current <= row2CurrentMax){
results[orderColumn1[IndexColumn1]] <- IndexColumn2
IndexColumn1 <- IndexColumn1 + 1
} else {
break
}
}
}
IndexColumn2 <- IndexColumn2 + 1
}
results
}


With 30000 rows:



> result <- dostuff(m1)
> end_time <- Sys.time()
> a <- end_time - start_time
> a
Time difference of 0.0600059 secs





share|improve this answer


























  • This could probably be improved on using rcpp, but I am not familiar enough with that.

    – JAD
    yesterday



















7














I don't imagine this is the fastest way but it will be somewhat faster than using the current for loop approach.



plus.10 <- m1[, 1] * 1.1
m2 <- m1[,2]
result <- sapply( plus.10, function(x) which.min(m2 < x))
result[plus.10 > max(m2) ] <- NA

result
[1] 3 1 NA 3 1 6 3 2 1 2


Edit: As requested by Ronak, microbenchmark results of the solutions proposed so far on 10000 rows:



Unit: milliseconds
expr min lq mean median uq max neval cld
h1 335.342689 337.35915 361.320461 341.804840 347.856556 516.230972 25 b
sindri 672.587291 688.78673 758.445467 713.240778 811.298608 1049.109844 25 d
op 865.567412 884.99514 993.066179 1006.694036 1026.434344 1424.755409 25 e
loco 675.809092 682.98591 731.256313 693.672064 807.007358 821.893865 25 d
dmitry 420.869493 427.56492 454.439806 433.656519 438.367480 607.030825 25 c
jad 4.369628 4.41044 4.735393 4.503657 4.556527 7.488471 25 a





share|improve this answer





















  • 1





    And if you first store m2 = m1[,2] you can be faster.

    – LocoGris
    yesterday











  • @RonakShah - Yes, I didn't mean generally, I meant specifically the solution proposed by OP.

    – H 1
    yesterday



















2














Here is an attempt using match() which reduces the time compared to the r = 30000 example in the original post by about 25%.



sapply(m1[, 1] * 1.1, function(x) match(TRUE, m1[, 2] > x))

[1] 3 1 NA 3 1 6 3 2 1 2





share|improve this answer

































    0














    I propose these:



    r <-30000
    c <- 2
    set.seed(333)

    m1 <- matrix(runif(r*c)+1, r, c)
    x2 <-m1[, 2]



    start_time <- Sys.time()

    result <- lapply(m1[, 1], function(x) {
    min(which(m1[,2]>(1.1*x)))
    })
    end_time <- Sys.time()
    a <- end_time - start_time
    a


    start_time <- Sys.time()

    result <- lapply(m1[, 1], function(x) {
    min(which(x2>(1.1*x)))
    })
    end_time <- Sys.time()
    a <- end_time - start_time
    a


    First one: 8.6 s
    Second one: 6.4 s






    share|improve this answer































      0














      The best way to optimize your code is to use data.table package



      This code gives you > 2x speed up.



      library(data.table);

      setDTthreads(0);

      r <- 30000;
      c <- 2;
      set.seed(333);

      m1 <- matrix(runif(r*c)+1, r, c);
      result1 <- rep(NA, nrow(m1));

      start_time <- Sys.time();

      for (i in 1:nrow(m1))
      {
      result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
      }

      #result1

      end_time <- Sys.time()
      a <- end_time - start_time
      a


      start_time <- Sys.time()

      tstDT <- data.table(m1);
      #result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
      result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]

      #result2

      end_time <- Sys.time()
      a <- end_time - start_time
      a


      Little comment - I use data.table compiled by gcc with march=native and O3. Possible O2 and march=core (as in standard package by installation) speed up will be less, but...



      Result:



      > library(data.table);
      >
      > setDTthreads(0);
      >
      > r <- 30000;
      > c <- 2;
      > set.seed(333);
      >
      > m1 <- matrix(runif(r*c)+1, r, c);
      > result1 <- rep(NA, nrow(m1));
      >
      > start_time <- Sys.time();
      >
      > for (i in 1:nrow(m1))
      + {
      + result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
      + }
      >
      > #result1
      >
      > end_time <- Sys.time()
      > a <- end_time - start_time
      > a
      Time difference of 8.738938 secs
      >
      >
      > start_time <- Sys.time()
      >
      > tstDT <- data.table(m1);
      > #result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
      > result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]
      >
      > #result2
      >
      > end_time <- Sys.time()
      > a <- end_time - start_time
      > a
      Time difference of 3.582921 secs
      >
      >
      >
      >





      share|improve this answer





















      • 1





        This is definitely not the best way. As the other answers demonstrate much better ways, with bigger speedups.

        – Konrad Rudolph
        yesterday











      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55274665%2fget-first-value-that-matches-condition-loop-too-slow%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      5 Answers
      5






      active

      oldest

      votes








      5 Answers
      5






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      6














      There are some shortcuts you can take here. You are looking for the first value in column 2 that is higher than some other value. This means that it is never worth looking at values that are lower than what we have previously seen in column 2.



      In your example with 10 rows, that would be as follows:



      > cummax(m1[, 2])
      [1] 1.393902 1.474218 1.891222 1.891222 1.891222 1.911469 1.911469 1.911469 1.911469 1.911469
      > which(cummax(m1[, 2]) == m1[, 2])
      [1] 1 2 3 6


      And as you can see, these are the only values in your result vector.



      A second optimisation that could be made is to order the first column. If you start looking for the lowest value first, and work your way up, you don't have to look through the second column each time. You only have to step to the next row there if there are no matches with the left row anymore.



      This does bear the cost of sorting the matrix, but afterwards the result can be found using a single pass through both columns.



      dostuff <- function(m1){
      orderColumn1 <- order(m1[, 1])

      plus.10 <- m1[, 1] * 1.1

      results <- rep(NA, length(plus.10))

      IndexColumn1 <- 1
      IndexColumn2 <- 1
      row2CurrentMax <- 0
      while(IndexColumn2 <= nrow(m1)){
      row2Current <- m1[IndexColumn2, 2]
      if(row2Current > row2CurrentMax){
      row2CurrentMax <- row2Current
      while(TRUE){
      row1Current <- plus.10[orderColumn1[IndexColumn1]]
      if(row1Current <= row2CurrentMax){
      results[orderColumn1[IndexColumn1]] <- IndexColumn2
      IndexColumn1 <- IndexColumn1 + 1
      } else {
      break
      }
      }
      }
      IndexColumn2 <- IndexColumn2 + 1
      }
      results
      }


      With 30000 rows:



      > result <- dostuff(m1)
      > end_time <- Sys.time()
      > a <- end_time - start_time
      > a
      Time difference of 0.0600059 secs





      share|improve this answer


























      • This could probably be improved on using rcpp, but I am not familiar enough with that.

        – JAD
        yesterday
















      6














      There are some shortcuts you can take here. You are looking for the first value in column 2 that is higher than some other value. This means that it is never worth looking at values that are lower than what we have previously seen in column 2.



      In your example with 10 rows, that would be as follows:



      > cummax(m1[, 2])
      [1] 1.393902 1.474218 1.891222 1.891222 1.891222 1.911469 1.911469 1.911469 1.911469 1.911469
      > which(cummax(m1[, 2]) == m1[, 2])
      [1] 1 2 3 6


      And as you can see, these are the only values in your result vector.



      A second optimisation that could be made is to order the first column. If you start looking for the lowest value first, and work your way up, you don't have to look through the second column each time. You only have to step to the next row there if there are no matches with the left row anymore.



      This does bear the cost of sorting the matrix, but afterwards the result can be found using a single pass through both columns.



      dostuff <- function(m1){
      orderColumn1 <- order(m1[, 1])

      plus.10 <- m1[, 1] * 1.1

      results <- rep(NA, length(plus.10))

      IndexColumn1 <- 1
      IndexColumn2 <- 1
      row2CurrentMax <- 0
      while(IndexColumn2 <= nrow(m1)){
      row2Current <- m1[IndexColumn2, 2]
      if(row2Current > row2CurrentMax){
      row2CurrentMax <- row2Current
      while(TRUE){
      row1Current <- plus.10[orderColumn1[IndexColumn1]]
      if(row1Current <= row2CurrentMax){
      results[orderColumn1[IndexColumn1]] <- IndexColumn2
      IndexColumn1 <- IndexColumn1 + 1
      } else {
      break
      }
      }
      }
      IndexColumn2 <- IndexColumn2 + 1
      }
      results
      }


      With 30000 rows:



      > result <- dostuff(m1)
      > end_time <- Sys.time()
      > a <- end_time - start_time
      > a
      Time difference of 0.0600059 secs





      share|improve this answer


























      • This could probably be improved on using rcpp, but I am not familiar enough with that.

        – JAD
        yesterday














      6












      6








      6







      There are some shortcuts you can take here. You are looking for the first value in column 2 that is higher than some other value. This means that it is never worth looking at values that are lower than what we have previously seen in column 2.



      In your example with 10 rows, that would be as follows:



      > cummax(m1[, 2])
      [1] 1.393902 1.474218 1.891222 1.891222 1.891222 1.911469 1.911469 1.911469 1.911469 1.911469
      > which(cummax(m1[, 2]) == m1[, 2])
      [1] 1 2 3 6


      And as you can see, these are the only values in your result vector.



      A second optimisation that could be made is to order the first column. If you start looking for the lowest value first, and work your way up, you don't have to look through the second column each time. You only have to step to the next row there if there are no matches with the left row anymore.



      This does bear the cost of sorting the matrix, but afterwards the result can be found using a single pass through both columns.



      dostuff <- function(m1){
      orderColumn1 <- order(m1[, 1])

      plus.10 <- m1[, 1] * 1.1

      results <- rep(NA, length(plus.10))

      IndexColumn1 <- 1
      IndexColumn2 <- 1
      row2CurrentMax <- 0
      while(IndexColumn2 <= nrow(m1)){
      row2Current <- m1[IndexColumn2, 2]
      if(row2Current > row2CurrentMax){
      row2CurrentMax <- row2Current
      while(TRUE){
      row1Current <- plus.10[orderColumn1[IndexColumn1]]
      if(row1Current <= row2CurrentMax){
      results[orderColumn1[IndexColumn1]] <- IndexColumn2
      IndexColumn1 <- IndexColumn1 + 1
      } else {
      break
      }
      }
      }
      IndexColumn2 <- IndexColumn2 + 1
      }
      results
      }


      With 30000 rows:



      > result <- dostuff(m1)
      > end_time <- Sys.time()
      > a <- end_time - start_time
      > a
      Time difference of 0.0600059 secs





      share|improve this answer















      There are some shortcuts you can take here. You are looking for the first value in column 2 that is higher than some other value. This means that it is never worth looking at values that are lower than what we have previously seen in column 2.



      In your example with 10 rows, that would be as follows:



      > cummax(m1[, 2])
      [1] 1.393902 1.474218 1.891222 1.891222 1.891222 1.911469 1.911469 1.911469 1.911469 1.911469
      > which(cummax(m1[, 2]) == m1[, 2])
      [1] 1 2 3 6


      And as you can see, these are the only values in your result vector.



      A second optimisation that could be made is to order the first column. If you start looking for the lowest value first, and work your way up, you don't have to look through the second column each time. You only have to step to the next row there if there are no matches with the left row anymore.



      This does bear the cost of sorting the matrix, but afterwards the result can be found using a single pass through both columns.



      dostuff <- function(m1){
      orderColumn1 <- order(m1[, 1])

      plus.10 <- m1[, 1] * 1.1

      results <- rep(NA, length(plus.10))

      IndexColumn1 <- 1
      IndexColumn2 <- 1
      row2CurrentMax <- 0
      while(IndexColumn2 <= nrow(m1)){
      row2Current <- m1[IndexColumn2, 2]
      if(row2Current > row2CurrentMax){
      row2CurrentMax <- row2Current
      while(TRUE){
      row1Current <- plus.10[orderColumn1[IndexColumn1]]
      if(row1Current <= row2CurrentMax){
      results[orderColumn1[IndexColumn1]] <- IndexColumn2
      IndexColumn1 <- IndexColumn1 + 1
      } else {
      break
      }
      }
      }
      IndexColumn2 <- IndexColumn2 + 1
      }
      results
      }


      With 30000 rows:



      > result <- dostuff(m1)
      > end_time <- Sys.time()
      > a <- end_time - start_time
      > a
      Time difference of 0.0600059 secs






      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited yesterday

























      answered yesterday









      JADJAD

      1,15711125




      1,15711125













      • This could probably be improved on using rcpp, but I am not familiar enough with that.

        – JAD
        yesterday



















      • This could probably be improved on using rcpp, but I am not familiar enough with that.

        – JAD
        yesterday

















      This could probably be improved on using rcpp, but I am not familiar enough with that.

      – JAD
      yesterday





      This could probably be improved on using rcpp, but I am not familiar enough with that.

      – JAD
      yesterday













      7














      I don't imagine this is the fastest way but it will be somewhat faster than using the current for loop approach.



      plus.10 <- m1[, 1] * 1.1
      m2 <- m1[,2]
      result <- sapply( plus.10, function(x) which.min(m2 < x))
      result[plus.10 > max(m2) ] <- NA

      result
      [1] 3 1 NA 3 1 6 3 2 1 2


      Edit: As requested by Ronak, microbenchmark results of the solutions proposed so far on 10000 rows:



      Unit: milliseconds
      expr min lq mean median uq max neval cld
      h1 335.342689 337.35915 361.320461 341.804840 347.856556 516.230972 25 b
      sindri 672.587291 688.78673 758.445467 713.240778 811.298608 1049.109844 25 d
      op 865.567412 884.99514 993.066179 1006.694036 1026.434344 1424.755409 25 e
      loco 675.809092 682.98591 731.256313 693.672064 807.007358 821.893865 25 d
      dmitry 420.869493 427.56492 454.439806 433.656519 438.367480 607.030825 25 c
      jad 4.369628 4.41044 4.735393 4.503657 4.556527 7.488471 25 a





      share|improve this answer





















      • 1





        And if you first store m2 = m1[,2] you can be faster.

        – LocoGris
        yesterday











      • @RonakShah - Yes, I didn't mean generally, I meant specifically the solution proposed by OP.

        – H 1
        yesterday
















      7














      I don't imagine this is the fastest way but it will be somewhat faster than using the current for loop approach.



      plus.10 <- m1[, 1] * 1.1
      m2 <- m1[,2]
      result <- sapply( plus.10, function(x) which.min(m2 < x))
      result[plus.10 > max(m2) ] <- NA

      result
      [1] 3 1 NA 3 1 6 3 2 1 2


      Edit: As requested by Ronak, microbenchmark results of the solutions proposed so far on 10000 rows:



      Unit: milliseconds
      expr min lq mean median uq max neval cld
      h1 335.342689 337.35915 361.320461 341.804840 347.856556 516.230972 25 b
      sindri 672.587291 688.78673 758.445467 713.240778 811.298608 1049.109844 25 d
      op 865.567412 884.99514 993.066179 1006.694036 1026.434344 1424.755409 25 e
      loco 675.809092 682.98591 731.256313 693.672064 807.007358 821.893865 25 d
      dmitry 420.869493 427.56492 454.439806 433.656519 438.367480 607.030825 25 c
      jad 4.369628 4.41044 4.735393 4.503657 4.556527 7.488471 25 a





      share|improve this answer





















      • 1





        And if you first store m2 = m1[,2] you can be faster.

        – LocoGris
        yesterday











      • @RonakShah - Yes, I didn't mean generally, I meant specifically the solution proposed by OP.

        – H 1
        yesterday














      7












      7








      7







      I don't imagine this is the fastest way but it will be somewhat faster than using the current for loop approach.



      plus.10 <- m1[, 1] * 1.1
      m2 <- m1[,2]
      result <- sapply( plus.10, function(x) which.min(m2 < x))
      result[plus.10 > max(m2) ] <- NA

      result
      [1] 3 1 NA 3 1 6 3 2 1 2


      Edit: As requested by Ronak, microbenchmark results of the solutions proposed so far on 10000 rows:



      Unit: milliseconds
      expr min lq mean median uq max neval cld
      h1 335.342689 337.35915 361.320461 341.804840 347.856556 516.230972 25 b
      sindri 672.587291 688.78673 758.445467 713.240778 811.298608 1049.109844 25 d
      op 865.567412 884.99514 993.066179 1006.694036 1026.434344 1424.755409 25 e
      loco 675.809092 682.98591 731.256313 693.672064 807.007358 821.893865 25 d
      dmitry 420.869493 427.56492 454.439806 433.656519 438.367480 607.030825 25 c
      jad 4.369628 4.41044 4.735393 4.503657 4.556527 7.488471 25 a





      share|improve this answer















      I don't imagine this is the fastest way but it will be somewhat faster than using the current for loop approach.



      plus.10 <- m1[, 1] * 1.1
      m2 <- m1[,2]
      result <- sapply( plus.10, function(x) which.min(m2 < x))
      result[plus.10 > max(m2) ] <- NA

      result
      [1] 3 1 NA 3 1 6 3 2 1 2


      Edit: As requested by Ronak, microbenchmark results of the solutions proposed so far on 10000 rows:



      Unit: milliseconds
      expr min lq mean median uq max neval cld
      h1 335.342689 337.35915 361.320461 341.804840 347.856556 516.230972 25 b
      sindri 672.587291 688.78673 758.445467 713.240778 811.298608 1049.109844 25 d
      op 865.567412 884.99514 993.066179 1006.694036 1026.434344 1424.755409 25 e
      loco 675.809092 682.98591 731.256313 693.672064 807.007358 821.893865 25 d
      dmitry 420.869493 427.56492 454.439806 433.656519 438.367480 607.030825 25 c
      jad 4.369628 4.41044 4.735393 4.503657 4.556527 7.488471 25 a






      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited 22 hours ago

























      answered yesterday









      H 1H 1

      3,17211124




      3,17211124








      • 1





        And if you first store m2 = m1[,2] you can be faster.

        – LocoGris
        yesterday











      • @RonakShah - Yes, I didn't mean generally, I meant specifically the solution proposed by OP.

        – H 1
        yesterday














      • 1





        And if you first store m2 = m1[,2] you can be faster.

        – LocoGris
        yesterday











      • @RonakShah - Yes, I didn't mean generally, I meant specifically the solution proposed by OP.

        – H 1
        yesterday








      1




      1





      And if you first store m2 = m1[,2] you can be faster.

      – LocoGris
      yesterday





      And if you first store m2 = m1[,2] you can be faster.

      – LocoGris
      yesterday













      @RonakShah - Yes, I didn't mean generally, I meant specifically the solution proposed by OP.

      – H 1
      yesterday





      @RonakShah - Yes, I didn't mean generally, I meant specifically the solution proposed by OP.

      – H 1
      yesterday











      2














      Here is an attempt using match() which reduces the time compared to the r = 30000 example in the original post by about 25%.



      sapply(m1[, 1] * 1.1, function(x) match(TRUE, m1[, 2] > x))

      [1] 3 1 NA 3 1 6 3 2 1 2





      share|improve this answer






























        2














        Here is an attempt using match() which reduces the time compared to the r = 30000 example in the original post by about 25%.



        sapply(m1[, 1] * 1.1, function(x) match(TRUE, m1[, 2] > x))

        [1] 3 1 NA 3 1 6 3 2 1 2





        share|improve this answer




























          2












          2








          2







          Here is an attempt using match() which reduces the time compared to the r = 30000 example in the original post by about 25%.



          sapply(m1[, 1] * 1.1, function(x) match(TRUE, m1[, 2] > x))

          [1] 3 1 NA 3 1 6 3 2 1 2





          share|improve this answer















          Here is an attempt using match() which reduces the time compared to the r = 30000 example in the original post by about 25%.



          sapply(m1[, 1] * 1.1, function(x) match(TRUE, m1[, 2] > x))

          [1] 3 1 NA 3 1 6 3 2 1 2






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited yesterday

























          answered yesterday









          sindri_baldursindri_baldur

          8,2551033




          8,2551033























              0














              I propose these:



              r <-30000
              c <- 2
              set.seed(333)

              m1 <- matrix(runif(r*c)+1, r, c)
              x2 <-m1[, 2]



              start_time <- Sys.time()

              result <- lapply(m1[, 1], function(x) {
              min(which(m1[,2]>(1.1*x)))
              })
              end_time <- Sys.time()
              a <- end_time - start_time
              a


              start_time <- Sys.time()

              result <- lapply(m1[, 1], function(x) {
              min(which(x2>(1.1*x)))
              })
              end_time <- Sys.time()
              a <- end_time - start_time
              a


              First one: 8.6 s
              Second one: 6.4 s






              share|improve this answer




























                0














                I propose these:



                r <-30000
                c <- 2
                set.seed(333)

                m1 <- matrix(runif(r*c)+1, r, c)
                x2 <-m1[, 2]



                start_time <- Sys.time()

                result <- lapply(m1[, 1], function(x) {
                min(which(m1[,2]>(1.1*x)))
                })
                end_time <- Sys.time()
                a <- end_time - start_time
                a


                start_time <- Sys.time()

                result <- lapply(m1[, 1], function(x) {
                min(which(x2>(1.1*x)))
                })
                end_time <- Sys.time()
                a <- end_time - start_time
                a


                First one: 8.6 s
                Second one: 6.4 s






                share|improve this answer


























                  0












                  0








                  0







                  I propose these:



                  r <-30000
                  c <- 2
                  set.seed(333)

                  m1 <- matrix(runif(r*c)+1, r, c)
                  x2 <-m1[, 2]



                  start_time <- Sys.time()

                  result <- lapply(m1[, 1], function(x) {
                  min(which(m1[,2]>(1.1*x)))
                  })
                  end_time <- Sys.time()
                  a <- end_time - start_time
                  a


                  start_time <- Sys.time()

                  result <- lapply(m1[, 1], function(x) {
                  min(which(x2>(1.1*x)))
                  })
                  end_time <- Sys.time()
                  a <- end_time - start_time
                  a


                  First one: 8.6 s
                  Second one: 6.4 s






                  share|improve this answer













                  I propose these:



                  r <-30000
                  c <- 2
                  set.seed(333)

                  m1 <- matrix(runif(r*c)+1, r, c)
                  x2 <-m1[, 2]



                  start_time <- Sys.time()

                  result <- lapply(m1[, 1], function(x) {
                  min(which(m1[,2]>(1.1*x)))
                  })
                  end_time <- Sys.time()
                  a <- end_time - start_time
                  a


                  start_time <- Sys.time()

                  result <- lapply(m1[, 1], function(x) {
                  min(which(x2>(1.1*x)))
                  })
                  end_time <- Sys.time()
                  a <- end_time - start_time
                  a


                  First one: 8.6 s
                  Second one: 6.4 s







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered yesterday









                  LocoGrisLocoGris

                  2,0471725




                  2,0471725























                      0














                      The best way to optimize your code is to use data.table package



                      This code gives you > 2x speed up.



                      library(data.table);

                      setDTthreads(0);

                      r <- 30000;
                      c <- 2;
                      set.seed(333);

                      m1 <- matrix(runif(r*c)+1, r, c);
                      result1 <- rep(NA, nrow(m1));

                      start_time <- Sys.time();

                      for (i in 1:nrow(m1))
                      {
                      result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
                      }

                      #result1

                      end_time <- Sys.time()
                      a <- end_time - start_time
                      a


                      start_time <- Sys.time()

                      tstDT <- data.table(m1);
                      #result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
                      result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]

                      #result2

                      end_time <- Sys.time()
                      a <- end_time - start_time
                      a


                      Little comment - I use data.table compiled by gcc with march=native and O3. Possible O2 and march=core (as in standard package by installation) speed up will be less, but...



                      Result:



                      > library(data.table);
                      >
                      > setDTthreads(0);
                      >
                      > r <- 30000;
                      > c <- 2;
                      > set.seed(333);
                      >
                      > m1 <- matrix(runif(r*c)+1, r, c);
                      > result1 <- rep(NA, nrow(m1));
                      >
                      > start_time <- Sys.time();
                      >
                      > for (i in 1:nrow(m1))
                      + {
                      + result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
                      + }
                      >
                      > #result1
                      >
                      > end_time <- Sys.time()
                      > a <- end_time - start_time
                      > a
                      Time difference of 8.738938 secs
                      >
                      >
                      > start_time <- Sys.time()
                      >
                      > tstDT <- data.table(m1);
                      > #result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
                      > result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]
                      >
                      > #result2
                      >
                      > end_time <- Sys.time()
                      > a <- end_time - start_time
                      > a
                      Time difference of 3.582921 secs
                      >
                      >
                      >
                      >





                      share|improve this answer





















                      • 1





                        This is definitely not the best way. As the other answers demonstrate much better ways, with bigger speedups.

                        – Konrad Rudolph
                        yesterday
















                      0














                      The best way to optimize your code is to use data.table package



                      This code gives you > 2x speed up.



                      library(data.table);

                      setDTthreads(0);

                      r <- 30000;
                      c <- 2;
                      set.seed(333);

                      m1 <- matrix(runif(r*c)+1, r, c);
                      result1 <- rep(NA, nrow(m1));

                      start_time <- Sys.time();

                      for (i in 1:nrow(m1))
                      {
                      result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
                      }

                      #result1

                      end_time <- Sys.time()
                      a <- end_time - start_time
                      a


                      start_time <- Sys.time()

                      tstDT <- data.table(m1);
                      #result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
                      result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]

                      #result2

                      end_time <- Sys.time()
                      a <- end_time - start_time
                      a


                      Little comment - I use data.table compiled by gcc with march=native and O3. Possible O2 and march=core (as in standard package by installation) speed up will be less, but...



                      Result:



                      > library(data.table);
                      >
                      > setDTthreads(0);
                      >
                      > r <- 30000;
                      > c <- 2;
                      > set.seed(333);
                      >
                      > m1 <- matrix(runif(r*c)+1, r, c);
                      > result1 <- rep(NA, nrow(m1));
                      >
                      > start_time <- Sys.time();
                      >
                      > for (i in 1:nrow(m1))
                      + {
                      + result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
                      + }
                      >
                      > #result1
                      >
                      > end_time <- Sys.time()
                      > a <- end_time - start_time
                      > a
                      Time difference of 8.738938 secs
                      >
                      >
                      > start_time <- Sys.time()
                      >
                      > tstDT <- data.table(m1);
                      > #result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
                      > result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]
                      >
                      > #result2
                      >
                      > end_time <- Sys.time()
                      > a <- end_time - start_time
                      > a
                      Time difference of 3.582921 secs
                      >
                      >
                      >
                      >





                      share|improve this answer





















                      • 1





                        This is definitely not the best way. As the other answers demonstrate much better ways, with bigger speedups.

                        – Konrad Rudolph
                        yesterday














                      0












                      0








                      0







                      The best way to optimize your code is to use data.table package



                      This code gives you > 2x speed up.



                      library(data.table);

                      setDTthreads(0);

                      r <- 30000;
                      c <- 2;
                      set.seed(333);

                      m1 <- matrix(runif(r*c)+1, r, c);
                      result1 <- rep(NA, nrow(m1));

                      start_time <- Sys.time();

                      for (i in 1:nrow(m1))
                      {
                      result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
                      }

                      #result1

                      end_time <- Sys.time()
                      a <- end_time - start_time
                      a


                      start_time <- Sys.time()

                      tstDT <- data.table(m1);
                      #result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
                      result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]

                      #result2

                      end_time <- Sys.time()
                      a <- end_time - start_time
                      a


                      Little comment - I use data.table compiled by gcc with march=native and O3. Possible O2 and march=core (as in standard package by installation) speed up will be less, but...



                      Result:



                      > library(data.table);
                      >
                      > setDTthreads(0);
                      >
                      > r <- 30000;
                      > c <- 2;
                      > set.seed(333);
                      >
                      > m1 <- matrix(runif(r*c)+1, r, c);
                      > result1 <- rep(NA, nrow(m1));
                      >
                      > start_time <- Sys.time();
                      >
                      > for (i in 1:nrow(m1))
                      + {
                      + result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
                      + }
                      >
                      > #result1
                      >
                      > end_time <- Sys.time()
                      > a <- end_time - start_time
                      > a
                      Time difference of 8.738938 secs
                      >
                      >
                      > start_time <- Sys.time()
                      >
                      > tstDT <- data.table(m1);
                      > #result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
                      > result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]
                      >
                      > #result2
                      >
                      > end_time <- Sys.time()
                      > a <- end_time - start_time
                      > a
                      Time difference of 3.582921 secs
                      >
                      >
                      >
                      >





                      share|improve this answer















                      The best way to optimize your code is to use data.table package



                      This code gives you > 2x speed up.



                      library(data.table);

                      setDTthreads(0);

                      r <- 30000;
                      c <- 2;
                      set.seed(333);

                      m1 <- matrix(runif(r*c)+1, r, c);
                      result1 <- rep(NA, nrow(m1));

                      start_time <- Sys.time();

                      for (i in 1:nrow(m1))
                      {
                      result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
                      }

                      #result1

                      end_time <- Sys.time()
                      a <- end_time - start_time
                      a


                      start_time <- Sys.time()

                      tstDT <- data.table(m1);
                      #result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
                      result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]

                      #result2

                      end_time <- Sys.time()
                      a <- end_time - start_time
                      a


                      Little comment - I use data.table compiled by gcc with march=native and O3. Possible O2 and march=core (as in standard package by installation) speed up will be less, but...



                      Result:



                      > library(data.table);
                      >
                      > setDTthreads(0);
                      >
                      > r <- 30000;
                      > c <- 2;
                      > set.seed(333);
                      >
                      > m1 <- matrix(runif(r*c)+1, r, c);
                      > result1 <- rep(NA, nrow(m1));
                      >
                      > start_time <- Sys.time();
                      >
                      > for (i in 1:nrow(m1))
                      + {
                      + result1[i] <- which(m1[,2]>(1.1*m1[,1][i]))[1];
                      + }
                      >
                      > #result1
                      >
                      > end_time <- Sys.time()
                      > a <- end_time - start_time
                      > a
                      Time difference of 8.738938 secs
                      >
                      >
                      > start_time <- Sys.time()
                      >
                      > tstDT <- data.table(m1);
                      > #result2 <- tstDT[, sapply(V1, function(elem) { which(V2 > 1.1*elem)[1] })]
                      > result2 <- tstDT[, sapply(V1, function(x) match(TRUE, V2 > 1.1*x) )]
                      >
                      > #result2
                      >
                      > end_time <- Sys.time()
                      > a <- end_time - start_time
                      > a
                      Time difference of 3.582921 secs
                      >
                      >
                      >
                      >






                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited yesterday

























                      answered yesterday









                      DmitriyDmitriy

                      314314




                      314314








                      • 1





                        This is definitely not the best way. As the other answers demonstrate much better ways, with bigger speedups.

                        – Konrad Rudolph
                        yesterday














                      • 1





                        This is definitely not the best way. As the other answers demonstrate much better ways, with bigger speedups.

                        – Konrad Rudolph
                        yesterday








                      1




                      1





                      This is definitely not the best way. As the other answers demonstrate much better ways, with bigger speedups.

                      – Konrad Rudolph
                      yesterday





                      This is definitely not the best way. As the other answers demonstrate much better ways, with bigger speedups.

                      – Konrad Rudolph
                      yesterday


















                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55274665%2fget-first-value-that-matches-condition-loop-too-slow%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      How did Captain America manage to do this?

                      迪纳利

                      南乌拉尔铁路局