Work flow automation with bash script?

mimosa · Post by **mimosa** » 4. Mar 2012, 09:53

Indeed, gapan is right, the script almost certainly needs debugging on a realistic sample.

No need for the lynx stage if you use the Python script. It extracts well-formed email addresses irrespective of how they are formatted. However, it does assume certain things which may cause it to fail with some formats, which is why I suggested trying it out with all the formats google docs downloads to. If not, tweaking it will just be a matter of telling it to replace some more characters with whitespace. Or it might become quite robust and general if I rewrote it to check the pieces it throws away for further addresses. At the moment it will fail on something like this:

foo.bar@salix.com#$%&bar.foo@ubuntu.com

but to sum up, easily fixed, especially if you canpost a sample

globetrotterdk · Post by **globetrotterdk** » 4. Mar 2012, 10:08

gapan wrote:It would help a lot if you posted part of that dump. You can edit the contact details before posting, so that the real ones don't get published here.

I thought about that as well. I tried opening it in Kate, but Kate chokes on it. Geany just gives me the info I posted. The same goes for Leafpad and nano. When I open it in VIm, I only get a bunch of html style code. The only way I have found to view the contents is by opening the file in Firefox. Very weird. In Lynx, it looks like this:

Code: Select all

A2 ny 7 Apple Pear C.Th. Zzzz St. 4, 3.th 2300 Kbh. S         
35362026 abc@humanrights.dk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 193      
                                                                                 .                                                                          
A2 0 7 Apple Orange 26 Diddely 4690 Haslev 5639 9050        
5578 8888 abc@nhs.dk

They both appear to be recognized by Lynx as e-mail addresses. When I search in Lynx for an e-mail address that I know isn't formatted properly, I get "unknown or ambiguous command as a response.

globetrotterdk · Post by **globetrotterdk** » 4. Mar 2012, 10:35

gapan wrote:It would help a lot if you posted part of that dump. You can edit the contact details before posting, so that the real ones don't get published here.

There isn't anything in the .xls file, just normal cells in a spreadsheet, in this case with some e-mail addresses formatted as "mailto:" hyperlinks and others not formatted. The picture is more mixed with the .csv file. Here is what an address looks like that isn't formatted as a "mailto:" hyperlink:

Code: Select all

,abc@politik.dk,,

I believe that the two following addresses are both formatted as "mailto:" hyperlinks:

Code: Select all

,,,,,,,,,abc@mail.dk

,,,,,,abc.def@jura.dk,professor,,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,242

I am not sure why they look differently. Note: the abc@mail.dk address doesn't have any tailing comas.

Post by **gapan** » 4. Mar 2012, 10:48

Well, here's a quick and dirty sed sequence that cleans up addresses on that sample csv.

Code: Select all

cat file.csv |grep "@" | \
sed "s/.*[,^]\(.*\)@\(.*\)/\1@\2/" |sed "s/,/__FOO__/"| \
sed "s/\(.*\)__FOO__.*/\1/"

I'm assuming that the "__FOO__" string is nowhere in your file.

edit: I've added a grep in there, so it only keeps lines with addresses.

mimosa · Post by **mimosa** » 4. Mar 2012, 11:56

sed is more succinct than Python

... but it is harder to read

globetrotterdk · Post by **globetrotterdk** » 4. Mar 2012, 12:07

gapan wrote:I've added a grep in there, so it only keeps lines with addresses.

Cheers, that seems to work. I then went into Vim and ran the following to format the e-mails so that I can just copy - paste into a BCC: line.

Code: Select all

%s/\n/, /g

globetrotterdk · Post by **globetrotterdk** » 4. Mar 2012, 12:10

mimosa wrote:sed is more succinct than Python

... but it is harder to read

I have to agree with both of you - and as I have yet to learn either one...

I am pretty much at your mercy

mimosa · Post by **mimosa** » 4. Mar 2012, 12:17

I have added a line to turn commas into whitespace:

http://pastebin.com/tDG6XgAA

All you should now need to do is download the data from Google to somefile.csv, execute

Code: Select all

$bcc.py somefile.csv

and the file bcc.txt should contain a list of addresses separated by commas, ready to paste into your email bcc field.

The script assumes:

1)email addresses contain only alphanumeric characters and full stops with a "@" somewhere in the middle (technically, you can put all sorts of strange stuff in an email address, but I've never seen one that did)
2)they are separated in the input file by spaces, newlines, or commas

It might be an idea to add tabs to the items in 2). But let me know if it works like this!

mimosa · Post by **mimosa** » 4. Mar 2012, 12:27

Code: Select all

vanilla[bin]$ cat raw.txt
,,,,,,,,,abc@mail.dk

,,,,,,abc.def@jura.dk,professor,,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,242MAILTO; 

vanilla[bin]$ bcc.py raw.txt
vanilla[bin]$ cat bcc.txt
abc@mail.dk, abc.def@jura.dk

globetrotterdk · Post by **globetrotterdk** » 4. Mar 2012, 12:30

Code: Select all

$ python bcc.py some_file.csv > 1some_file.csv
Traceback (most recent call last):
  File "bcc.py", line 88, in <module>
    main()
  File "bcc.py", line 23, in main
    address = wellFormed(address)      #strip it of forbidden elements
  File "bcc.py", line 44, in wellFormed
    user, domain = possAddress.split("@")  #divide into user and domain
ValueError: too many values to unpack

Work flow automation with bash script?

Re: Work flow automation with bash script?

Re: Work flow automation with bash script?

Re: Work flow automation with bash script?

Re: Work flow automation with bash script?

Re: Work flow automation with bash script?

Re: Work flow automation with bash script?

Re: Work flow automation with bash script?

Re: Work flow automation with bash script?

Re: Work flow automation with bash script?

Re: Work flow automation with bash script?