python – Eliminating duplicates – shall I use a set? – Education Career Blog

I need to read through a log file, extracting all paths, and return a sorted list of the paths containing no duplicates. What’s the best way to do it? Using a set?

I thought about something like this:

def geturls(filename)
  f = open(filename)
  s = set() # creates an empty set?

  for line in f:
    # see if the line matches some regex

    if match:
      s.add(match.group(1))

  f.close()

  return sorted(s)

EDIT

The items put in the set are path strings, which should be returned by the functions as a list sorted into alphabetical order.

EDIT 2
Here is some sample data:

10.254.254.28 – – 06/Aug/2007:00:12:20 -0700 “GET
/keyser/22300/ HTTP/1.0” 302 528 “-”
“Mozilla/5.0 (X11; U; Linux i686
(x86_64); en-US; rv:1.8.1.4)
Gecko/20070515 Firefox/2.0.0.4”
10.254.254.58 – – 06/Aug/2007:00:10:05 -0700 “GET
/edu/languages/google-python-class/images/puzzle/a-baaa.jpg HTTP/1.0” 200 2309 “-”
“googlebot-mscrawl-moma (enterprise;
bar-XYZ;
[email protected],[email protected],[email protected],[email protected])”
10.254.254.28 – – 06/Aug/2007:00:11:08 -0700 “GET
/favicon.ico HTTP/1.0” 302 3404 “-”
“googlebot-mscrawl-moma (enterprise;
bar-XYZ;

The interesting part are the urls between GET and HTTP. Maybe I should have mentioned that this is part of an exercise, and no real world data.

,

def sorted_paths(filename):
    with open(filename) as f:
       gen = (matches(line) for line in f)
       s = set(match.group(1) for match in gen if match)
    return sorted(s)

,

This is a good way of doing it, both in terms of performance and in terms of conciseness.

,

Only if the order doesn’t matter (since sets are unordered), and if the types are hashable (which strings are).

,

Only you should have full path names everywhere and if you are in Windows, names can be various cases as they are case insensitive. Also in Python you can also use / instead of \ (yes: be carefull of escaping the backslashes).

If you are dealing actually with URLs, most of time domain.com, domain.com/, www.domain.com and http://www.domain.com mean same thing and you should deside how to normalize.

,

you can use dictionary to store your path.

from collections import defaultdict
h=defaultdict(str)
uniq=
for line in open("file"):
    if "pattern" in line:
       # code to extract path here.
       extractedpath= ......
       hextractedpath.strip() = "" #using dictionary to store unique values
       if extractedpath not in uniq:
           uniq.append(extractedpath) #using a list to store unique values

Leave a Comment