I have a list of 50,000 ID’s in a flat file and need to remove any duplicate ID’s. Is there any efficient/recommended algorithm for my problem?
Thanks.
,
You can use the command line sort
program to order and filter the list of ids. This is a very efficient program and scales well too.
sort -u ids.txt > filteredIds.txt
,
Read into a dictionary line by line, discarding duplicates. When all read, write out to a new file.
,
I’ve did some experiments once and the fastest solution I could get in PHP was by sorting the items and manually remove all the duplicate items.
If performance isn’t that much of an issue for you (which I suspect, 50,000 is not that much) than you can use array_unique()
: http://php.net/array_unique
,
i guess if you have large enough memory allowance, you can put all these ids in array
$array$id = $id;
this would automatically weed out the dupes.
,
You can do:
file_put_contents($file,implode("\n",array_unique(file($file)));
How it works?
- Read the file using function
file
which returns an array. - Get rid of the duplicate lines using
array_unique
- implode those unique lines with “\n”
to get a string - write the string back to the file
usingfile_put_contents
This solution assumes that you’ve got one ID per line in the flat file.
,
You can do it via array / array_unique, in this example i guess your ids are separated by line braks, if thats not the case just change it
$file = file_get_contents('/path/to/file.txt');
$array = explode("\n",$file);
$array = array_unique($array);
$file = implode("\n",$array);
file_put_contents('/path/to/file.txt',$file);
,
If you can just explode the contents of the file on a comma (or any delimiter), then array_unique will produce the least (and cleanest) code, otherwise if your are parsing the file going with the $array$id = $id is the fastest and cleanest solution.
,
If you can use a terminal (or native unix execution), the easiest way: (assuming that there is nothing else in the file):
sort < ids.txt | uniq > filteredIds.txt