plyr in R very slow during merging – Education Career Blog

I am using plyr package in R to do the following:

  • pick up a row from table A according to column A and column B
  • find the row from table B having the same value in column A and column B
  • copy column C from table B to table A

I have made the progress bar to show the progress, but after it shows to 100% it seems to be still running, as I have see my CPU is still occupied by RGUI, but it just doesn’t end.

My table A is having about 40000 rows of data with unique column A and column B.

I suspect that the “combine” part of the “split-conquer-combine” workflow in plyr cannot handle this 40000 rows of data, because I can do it for another table with 4000 rows of data.

Any suggestions for improving the efficiency? Thanks.

UPDATE

Here is my code:

for (loop.filename in (1:nrow(filename)))
  {print("infection source merge")
   print(filenameloop.filename, "table_name")
   temp <- get(filenameloop.filename, "table_name")
   temp1 <- ddply(temp,
                  c("HOSP_NO", "REF_DATE"),
                  function(df)
                    {temp.infection.source <- abcdeabcde,"Case_Number"==unique(df,"HOSP_NO") &
                                              abcde,"Reference_Date"==unique(df,"REF_DATE"),
                                              "Case_Definition"
                     if (length(temp.infection.source)==0) {
                         temp.infection.source<-"NIL"
                         } else {
                         if (length(unique(temp.infection.source))>1) {
                             temp.infection.source<-"MULTIPLE"
                             } else {
                            temp.infection.source<-unique(temp.infection.source)}}
                     data.frame(df,
                                INFECTION_SOURCE=temp.infection.source)
                     },
                    .progress="text")
   assign(filenameloop.filename, "table_name", temp1)
  }

,

If I understood correctly what you’re trying to achieve, this should do what you want, pretty quick, and without too much memory loss.

#toy data
A <- data.frame(
    A=letters1:10,
    B=letters11:20,
    CC=1:10
)

ord <- sample(1:10)
B <- data.frame(
    A=letters1:10ord,
    B=letters11:20ord,
    CC=(1:10)ord
)
#combining values
A.comb <- paste(A$A,A$B,sep="-")
B.comb <- paste(B$A,B$B,sep="-")
#matching
A$DD <- B$CCmatch(A.comb,B.comb)
A

This applies only if the combinations are unique. If they’re not, you’ll have to take care of that first. Without the data it’s quite impossible to know what you’re trying to achieve exactly in your complete function, but you should be able to port the logic given here to your own case.

Leave a Comment