performance – Find and replace in CSV files with Python – Education Career Blog

Related to a previous question, I’m trying to do replacements over a number of large CSV files.

The column order (and contents) change between files, but for each file there are about 10 columns that I want and can identify by the column header names. I also have 1-2 dictionaries for each column I want. So for the columns I want, I want to use only the correct dictionaries and want to implement them sequentially.

An example of how I’ve tried to solve this:

# -*- coding: utf-8 -*-
import re

# imaginary csv file. pretend that we do not know the column order.
Header = u'col1', u'col2'
Line1 = u'A',u'X'
Line2 = u'B',u'Y'
fileLines = Line1,Line2

# dicts to translate lines
D1a = {u'A':u'a'}
D1b = {u'B':u'b'}
D2 = {u'X':u'x',u'Y':u'y'}

# dict to correspond header names with the correct dictionary.
# i would like the dictionaries to be read sequentially in col1.
refD = {u'col1':D1a,D1b,u'col2':D2}

# clunky replace function
def freplace(str, dict):
    rc = re.compile('|'.join(re.escape(k) for k in dict))
    def trans(m):
        return dictm.group(0)
    return rc.sub(trans, str)

# get correspondence between dictionary and column
C = 
for i in range(len(Header)):
    if Headeri in refD:
        C.append(refDHeaderi,i)

# loop through lines and make replacements
for line in fileLines:
    for i in range(len(line)):
        for j in range(len(C)):
            if Cj1 == i:
                for dict in Cj0:
                    linei = freplace(linei, dict)

My problem is that this code is quite slow, and I can’t figure out how to speed it up. I’m a beginner, and my guess was that my freplace function is largely what is slowing things down, because it has to compile for each column in each row. I would like to take the line rc = re.compile('|'.join(re.escape(k) for k in dict)) out of that function, but don’t know how to do that and still preserve what the rest of my code is doing.

,

There’s a ton of things that you can do to speed this up:

First, use the csv module. It provides efficient and bug-free methods for reading and writing CSV files. The DictReader object in particular is what you’re interested in: it will present every row it reads from the file as a dictionary keyed by its column name.

Second, compile your regexes once, not every time you use them. Save the compiled regexes in a dictionary keyed by the column that you’re going to apply them to.

Third, consider that if you apply a hundred regexes to a long string, you’re going to be scanning the string from start to finish a hundred times. That may not be the best approach to your problem; you might be better off investing some time in an approach that lets you read the string from start to end once.

,

You don’t need re:

# -*- coding: utf-8 -*-

# imaginary csv file. pretend that we do not know the column order.
Header = u'col1', u'col2'
Line1 = u'A',u'X'
Line2 = u'B',u'Y'
fileLines = Line1,Line2

# dicts to translate lines
D1a = {u'A':u'a'}
D1b = {u'B':u'b'}
D2 = {u'X':u'x',u'Y':u'y'}

# dict to correspond header names with the correct dictionary
refD = {u'col1':D1a,D1b,u'col2':D2}

# now let's have some fun...

for line in fileLines:
    for i, (param, word) in enumerate(zip(Header, line)):
        for minitranslator in refDparam:
            if word in minitranslator:
                linei = minitranslatorword

returns:

u'a', u'x', u'b', u'y'

,

So if that’s the case, and all 10 columns have the same names each time, but out of order, (I’m not sure if this is what you’re doing up there, but here goes) keep one array for the heading names, and one for each column split into elements (should be 10 items each line), now just offset which regex by doing a case/select combo, compare the element number of your header array, then inside the case, reference the data array at the same offset, since the name is what will get to the right case you should be able to use the same 10 regex’s repeatedly, and not have to recompile a new “command” each time.

I hope that makes sense. I’m sorry i don’t know the syntax to help you out, but I hope my idea is what you’re looking for
EDIT:
I.E.

initialize all regexes before starting your loops.

then after you read a line (and after the header line)

select arrayn

case “column1”

regex(data0);

case “column2”

regex(data1);
.
.
.
.
end select

This should call the right regex for the right columns

Leave a Comment