wikipedia

Support Wikipedia

Tuesday, July 21, 2009

Search and replace using regular expression groups

Regular expressions come in very handy for searching and replacing texts in data files. Most IDEs (eclipse on which this was tested) and text editors (TextPad, JEdit, etc.) support regular expressions. Take for example an input data file that you need to create/modify.
Here is the data file we are interested in...


07/10/09,11:10:20 AM,63463284,credit,salary electronic deposit,$1000.00
07/11/09,11:10:20 AM,63463284,credit,CD interest,$22.00
07/12/09,11:10:20 AM,63463284,credit,tax refund,$300.00
07/13/09,11:10:20 AM,63463284,credit,ATM deposit,$500.00
07/14/09,11:10:20 AM,63463284,debit,credit card payment,$137.00
07/15/09,11:10:20 AM,63463284,credit,checking a/c interest,$20.00
07/16/09,11:10:20 AM,63463284,debit,power bill,$80.00
07/17/09,11:10:20 AM,63463284,credit,Check #8774,$350.00
07/18/09,11:10:20 AM,63463284,debit,overdraft fee,$5.00
07/19/09,11:10:20 AM,63463284,credit,Check #167,$1000.00
07/10/09,11:10:20 AM,77667529,credit,salary electronic deposit,$2000.00
07/11/09,11:10:20 AM,77667529,credit,CD interest,$22.00
07/12/09,11:10:20 AM,77667529,credit,tax refund,$300.00
07/13/09,11:10:20 AM,77667529,credit,ATM deposit,$500.00
07/14/09,11:10:20 AM,77667529,debit,credit card payment,$137.00
07/15/09,11:10:20 AM,77667529,credit,checking a/c interest,$20.00
07/16/09,11:10:20 AM,77667529,debit,power bill,$80.00
07/17/09,11:10:20 AM,77667529,credit,Check #7766,$350.00
07/18/09,11:10:20 AM,77667529,debit,Check copy fee,$10.00
07/19/09,11:10:20 AM,77667529,credit,Check #125,$1000.00
07/10/09,11:10:20 AM,33445398,credit,salary electronic deposit,$2000.00
07/11/09,11:10:20 AM,33445398,credit,CD interest,$22.00
07/12/09,11:10:20 AM,33445398,credit,tax refund,$300.00
07/13/09,11:10:20 AM,33445398,credit,ATM deposit,$500.00
07/14/09,11:10:20 AM,33445398,debit,credit card payment,$137.00
07/15/09,11:10:20 AM,33445398,credit,checking a/c interest,$20.00
07/16/09,11:10:20 AM,33445398,debit,power bill,$80.00
07/17/09,11:10:20 AM,33445398,credit,Check #1255,$350.00
07/18/09,11:10:20 AM,33445398,debit,overdraft fee,$5.00
07/19/09,11:10:20 AM,33445398,credit,Check #433,$1000.00


The data is about bank transactions conducted on accounts.
Let's take a look at the first few lines of data...

Date Time account # type description currency amount
------- ------- --------------- ------- ------------------ -------------- ------------
07/10/09,11:10:20 AM,63463284,credit,salary electronic deposit,$1000.00
07/11/09,11:10:20 AM,63463284,credit,CD interest,$22.00

07/12/09,11:10:20 AM,63463284,credit,tax refund,$300.00

07/13/09,11:10:20 AM,63463284,credit,ATM deposit,$500.00

07/14/09,11:10:20 AM,63463284,debit,credit card payment,$137.00
07/15/09,11:10:20 AM,63463284,credit,checking a/c interest,$20.00

07/16/09,11:10:20 AM,63463284,debit,power bill,$80.00

07/17/09,11:10:20 AM,63463284,credit,Check #8774,$350.00

07/18/09,11:10:20 AM,63463284,debit,overdraft fee,$5.00

07/19/09,11:10:20 AM,63463284,credit,Check #167,$1000.00

..

..
..


Say we want to give credit to all overdraft fees charged for all accounts by adding an adjustment(credit) for the same amount. That is we want to add a new data line for every line which contains 'overdraft fee'. For example if you look at the lines of data above. For every line like this...
07/18/09,11:10:20 AM,63463284,debit,overdraft fee,$5.00
we want to add a new line like this....
07/18/09,11:10:20 AM,63463284,credit,overdraft fee adjustment,$5.00

The regular expression .*overdraft fee.*\n will get us the line containing overdraft fee. But we also need the timestamp at the beginning of the line. So let's refine the search string by adding a group.
^(.*)debit,overdraft fee,(.*)\n
So everything from the start of line upto debit forms group 1, which is the timestamp and account number.
The replacement string should be the entire last line which is $0 plus the same timestamp and account number($1) plus new text (credit,overdraft fee adjustment, $2) plus platform independent line feed (\R). Putting them all together the replacement string should be
$0$1credit,overdraft fee adjustment,$2\R
This was tested in eclipse IDE. In case of text editors like Textpad, the search string will be
^\\(.*\\)debit,overdraft fee,\\(.*\\)\n
and the replace string will be \0\1credit,overdraft fee adjustment,\2\n

Using the search and replace sub menu item of eclipse

No comments:

Post a Comment