Page 2 of 4

Re: Work flow automation with bash script?

Posted: 27. Feb 2012, 09:56
by globetrotterdk
Shador wrote:Actually you can merge the two steps like this:

Code: Select all

lynx -dump -force-html some_document.html | sed -n 's/^ *[0-9]*\. //p' | fgrep "mailto:" | sed -e 's/\n/, /g' > some_document.txt
That doesn't work for me. The file gets created, but this bit doesn't get implemented for some reason:

Code: Select all

sed -e 's/\n/, /g'
The data I get is with "mailto:" and the e-mail addresses in a column:
mailto:abc@humanrights.dk
mailto:def@bees.com
mailto:ghi@mail.dk

Re: Work flow automation with bash script?

Posted: 27. Feb 2012, 10:02
by gapan
sed won't work like that. It works on a line-by-line basis, so it never actually parses newline characters. You can use tr instead:

Code: Select all

tr "\n" ", "

Re: Work flow automation with bash script?

Posted: 27. Feb 2012, 10:46
by Shador
gapan wrote:sed won't work like that. It works on a line-by-line basis, so it never actually parses newline characters. You can use tr instead:

Code: Select all

tr "\n" ", "
Yes, you're right. Didn't look closely enough. For sed it is:

Code: Select all

 sed -e ':a;N;$!ba;s/\n/, /g'

Re: Work flow automation with bash script?

Posted: 27. Feb 2012, 12:43
by mimosa
The stripping of the "Mailto:" also doesn't seem to be working.

It would be quite easy to write a small sub-script that extracted well-formed email addresses from any file format reasonably close to text, and then glued them together with commas. This would be more robust and capable of accommodating changes in the organisation's workflow (such as if they stop using Google docs).

Re: Work flow automation with bash script?

Posted: 27. Feb 2012, 14:25
by gapan
To remove the "mailto:" part, you can run another sed, just after the fgrep "mailto:".

Code: Select all

sed "s/^mailto://"
The ^ might not make any difference, but it's not doing any harm either.

Re: Work flow automation with bash script?

Posted: 28. Feb 2012, 00:48
by mimosa
I expect you've solved your problem by now :)

Just for fun, though, here's a Python script I've written to remove the email addresses from their padding and stick them back together again. From the command line, you would do:
./bcc.py some_document.some_format
You should find a file bcc.txt with the cleaned up emails in the directory you called the script from. In this case, your shell script would go something like:

Code: Select all

#! /bin/sh
cd /home/globetrotter/path/to/directory
google docs get ... [whatever] some.document
bcc.py some.document
I haven't tested it much because I don't have a convenient sample. I should also stress that this probably isn't the best Python style, as I'm just starting out with Python. If you want to try it out, put it somewhere in your $PATH (such as /usr/local/bin) and make it executable. :)

http://pastebin.com/kvHSg3DQ

Re: Work flow automation with bash script?

Posted: 3. Mar 2012, 23:35
by globetrotterdk
Thanks for the postings. I have to admit that this is a bit over my head. I have been trying to do some research on the issue, but to make matters worse, I have found out that the woman in charge of maintaining the spreadsheet with the member data, can't figure out how to take an e-mail address in a cell and convert it to a "mailto:" hyperlink in Google Docs. I have sent here the necessary documentation and explained how to do it in practice, but it hasn't helped:
http://support.google.com/docs/bin/answ ... swer=44660

Code: Select all

=hyperlink("ab@jura.dk")
This means that I have the immediate problem of trying to extract those e-mail adresses that aren't formatted as "mailto:" hyperlinks from the "lynx html dump. I extracted what I thought were all of the e-mail addresses, only to find out afterwards that half of the e-mail addresses aren't formatted as "mailto:" hyperlinks. This has to be a quick and dirty solution due to time constraints. I have to send out the invitations to the annual general conference, that are still lacking. Any ideas?

Re: Work flow automation with bash script?

Posted: 4. Mar 2012, 00:07
by mimosa
Did you try my Python script? It's designed to be quite general in that it picks out anything that looks like an email address from surrounding material and then throws the latter away, so it's not limited to the problem as you originally described it. Depending on what that material is, a little tweaking might be needed, or maybe another of the formats the console tool allows you to download in will work better.

Am I right in thinking that all the addresses are currently surrounded by quotation marks?

Re: Work flow automation with bash script?

Posted: 4. Mar 2012, 08:00
by globetrotterdk
mimosa wrote:Did you try my Python script? It's designed to be quite general in that it picks out anything that looks like an email address from surrounding material and then throws the latter away, so it's not limited to the problem as you originally described it. Depending on what that material is, a little tweaking might be needed, or maybe another of the formats the console tool allows you to download in will work better.

Am I right in thinking that all the addresses are currently surrounded by quotation marks?
Hi mimosa. I haven't tried your python script yet. I was unsure about two things:
1) If the script worked from the lynx dump file.
2) How the script determines where the e-mail addresses are in the file.

I am unsure as to how the e-mail addresses are surrounded. When I open the "dump" file in Firefox, everything seems to in tables. I have tried opening the file in other editors - Nano, Geany, Vim - but all I get are lines that start like this:

Code: Select all

<!DOCTYPE html>
<html><head><title>database</title>

Re: Work flow automation with bash script?

Posted: 4. Mar 2012, 09:10
by gapan
It would help a lot if you posted part of that dump. You can edit the contact details before posting, so that the real ones don't get published here.