Work flow automation with bash script?

You have a problem with Salix? Post here and we'll do what we can to help.
User avatar
globetrotterdk
Posts: 435
Joined: 26. Oct 2010, 13:57
Location: Denmark

Re: Work flow automation with bash script?

Post by globetrotterdk »

Shador wrote:Actually you can merge the two steps like this:

Code: Select all

lynx -dump -force-html some_document.html | sed -n 's/^ *[0-9]*\. //p' | fgrep "mailto:" | sed -e 's/\n/, /g' > some_document.txt
That doesn't work for me. The file gets created, but this bit doesn't get implemented for some reason:

Code: Select all

sed -e 's/\n/, /g'
The data I get is with "mailto:" and the e-mail addresses in a column:
mailto:abc@humanrights.dk
mailto:def@bees.com
mailto:ghi@mail.dk
Military justice is to justice what military music is to music. - Groucho Marx
User avatar
gapan
Salix Wizard
Posts: 6368
Joined: 6. Jun 2009, 17:40

Re: Work flow automation with bash script?

Post by gapan »

sed won't work like that. It works on a line-by-line basis, so it never actually parses newline characters. You can use tr instead:

Code: Select all

tr "\n" ", "
Image
Image
Shador
Posts: 1295
Joined: 11. Jun 2009, 14:04
Location: Bavaria

Re: Work flow automation with bash script?

Post by Shador »

gapan wrote:sed won't work like that. It works on a line-by-line basis, so it never actually parses newline characters. You can use tr instead:

Code: Select all

tr "\n" ", "
Yes, you're right. Didn't look closely enough. For sed it is:

Code: Select all

 sed -e ':a;N;$!ba;s/\n/, /g'
Image
User avatar
mimosa
Salix Warrior
Posts: 3311
Joined: 25. May 2010, 17:02
Contact:

Re: Work flow automation with bash script?

Post by mimosa »

The stripping of the "Mailto:" also doesn't seem to be working.

It would be quite easy to write a small sub-script that extracted well-formed email addresses from any file format reasonably close to text, and then glued them together with commas. This would be more robust and capable of accommodating changes in the organisation's workflow (such as if they stop using Google docs).
User avatar
gapan
Salix Wizard
Posts: 6368
Joined: 6. Jun 2009, 17:40

Re: Work flow automation with bash script?

Post by gapan »

To remove the "mailto:" part, you can run another sed, just after the fgrep "mailto:".

Code: Select all

sed "s/^mailto://"
The ^ might not make any difference, but it's not doing any harm either.
Image
Image
User avatar
mimosa
Salix Warrior
Posts: 3311
Joined: 25. May 2010, 17:02
Contact:

Re: Work flow automation with bash script?

Post by mimosa »

I expect you've solved your problem by now :)

Just for fun, though, here's a Python script I've written to remove the email addresses from their padding and stick them back together again. From the command line, you would do:
./bcc.py some_document.some_format
You should find a file bcc.txt with the cleaned up emails in the directory you called the script from. In this case, your shell script would go something like:

Code: Select all

#! /bin/sh
cd /home/globetrotter/path/to/directory
google docs get ... [whatever] some.document
bcc.py some.document
I haven't tested it much because I don't have a convenient sample. I should also stress that this probably isn't the best Python style, as I'm just starting out with Python. If you want to try it out, put it somewhere in your $PATH (such as /usr/local/bin) and make it executable. :)

http://pastebin.com/kvHSg3DQ
User avatar
globetrotterdk
Posts: 435
Joined: 26. Oct 2010, 13:57
Location: Denmark

Re: Work flow automation with bash script?

Post by globetrotterdk »

Thanks for the postings. I have to admit that this is a bit over my head. I have been trying to do some research on the issue, but to make matters worse, I have found out that the woman in charge of maintaining the spreadsheet with the member data, can't figure out how to take an e-mail address in a cell and convert it to a "mailto:" hyperlink in Google Docs. I have sent here the necessary documentation and explained how to do it in practice, but it hasn't helped:
http://support.google.com/docs/bin/answ ... swer=44660

Code: Select all

=hyperlink("ab@jura.dk")
This means that I have the immediate problem of trying to extract those e-mail adresses that aren't formatted as "mailto:" hyperlinks from the "lynx html dump. I extracted what I thought were all of the e-mail addresses, only to find out afterwards that half of the e-mail addresses aren't formatted as "mailto:" hyperlinks. This has to be a quick and dirty solution due to time constraints. I have to send out the invitations to the annual general conference, that are still lacking. Any ideas?
Military justice is to justice what military music is to music. - Groucho Marx
User avatar
mimosa
Salix Warrior
Posts: 3311
Joined: 25. May 2010, 17:02
Contact:

Re: Work flow automation with bash script?

Post by mimosa »

Did you try my Python script? It's designed to be quite general in that it picks out anything that looks like an email address from surrounding material and then throws the latter away, so it's not limited to the problem as you originally described it. Depending on what that material is, a little tweaking might be needed, or maybe another of the formats the console tool allows you to download in will work better.

Am I right in thinking that all the addresses are currently surrounded by quotation marks?
User avatar
globetrotterdk
Posts: 435
Joined: 26. Oct 2010, 13:57
Location: Denmark

Re: Work flow automation with bash script?

Post by globetrotterdk »

mimosa wrote:Did you try my Python script? It's designed to be quite general in that it picks out anything that looks like an email address from surrounding material and then throws the latter away, so it's not limited to the problem as you originally described it. Depending on what that material is, a little tweaking might be needed, or maybe another of the formats the console tool allows you to download in will work better.

Am I right in thinking that all the addresses are currently surrounded by quotation marks?
Hi mimosa. I haven't tried your python script yet. I was unsure about two things:
1) If the script worked from the lynx dump file.
2) How the script determines where the e-mail addresses are in the file.

I am unsure as to how the e-mail addresses are surrounded. When I open the "dump" file in Firefox, everything seems to in tables. I have tried opening the file in other editors - Nano, Geany, Vim - but all I get are lines that start like this:

Code: Select all

<!DOCTYPE html>
<html><head><title>database</title>
Military justice is to justice what military music is to music. - Groucho Marx
User avatar
gapan
Salix Wizard
Posts: 6368
Joined: 6. Jun 2009, 17:40

Re: Work flow automation with bash script?

Post by gapan »

It would help a lot if you posted part of that dump. You can edit the contact details before posting, so that the real ones don't get published here.
Image
Image
Post Reply