Received: from mail-vc0-f188.google.com ([209.85.220.188]:58547) by stodi.digitalkingdom.org with esmtps (TLSv1:RC4-SHA:128) (Exim 4.76) (envelope-from ) id 1UBhx9-0007I2-Qn; Sat, 02 Mar 2013 00:38:22 -0800 Received: by mail-vc0-f188.google.com with SMTP id p16sf1681138vcq.25 for ; Sat, 02 Mar 2013 00:37:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20120806; h=x-received:x-beenthere:x-received:date:from:to:cc:message-id :in-reply-to:references:subject:mime-version:x-original-sender :reply-to:precedence:mailing-list:list-id:x-google-group-id :list-post:list-help:list-archive:sender:list-subscribe :list-unsubscribe:content-type; bh=OUwzqLCyWw/AD3yiIafFelGqDTifLK5z4ikN848ny3U=; b=vKFs6jA3sSS84nUWKvyq4mFFxbfFkULZdbk0WwAKC5i9ly+IAzQnglcJzp9ywPmZNY y3ObIhPBi5LbOejcBP+no5F45TxHYhg9OPrnhCsnS2J+VT6WDSyHCrIS8is+qxjy7OTU yxZP3M6v2ROyYPhY78jMEGy3P2oUm9cO7UPFiQABAjV6mdKQRDu7YtF3nXTE98bMLrRN 6sS6X8veHu8dTjSfS9CNf0I3fpLFXV6Q7zSEuNZgrLmG3I2BQ6MtUimZnJS2+3N5ZB0Q +6R5riNVvCCBiw3vrDKZzXxUxY7C8ZggIe9JV8TelFx7v0oigUjZ0TcSSUf6o6eZ6inL zh9g== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:x-beenthere:x-received:date:from:to:cc:message-id :in-reply-to:references:subject:mime-version:x-original-sender :reply-to:precedence:mailing-list:list-id:x-google-group-id :list-post:list-help:list-archive:sender:list-subscribe :list-unsubscribe:content-type; bh=OUwzqLCyWw/AD3yiIafFelGqDTifLK5z4ikN848ny3U=; b=mMJ6rm0nciE0+GUQiM5uB8oDf6tc7R+ZRhEmg9q8Jhb1h5CGv2oD6S1c5xSe2CjX71 FLSALNlnfJvmJcKvsCesIPFGOkhSW93bFGqatew5L0BykYG29cyHiV3/a5tZ/6o24SSH dUlaIPoOoJczYQSVcVgM5hdaoK/rPaAE6ykig8S2p2gqEBB+udPD+NscZKNt7kQzpxBR v2K7HtzNiHKpZZ3p+oi9wQzVBBk4VeaW+sceCH2WnWrJy+Wz52Ca4jk7cUTWM8JfeTed 8SIkSpCCpBtemT+CbzmuNeI5GevgYb9MErQZC3uDwgwJN7eJHnaU1p5QeKN9NZSrYfT4 Z0+w== X-Received: by 10.49.39.99 with SMTP id o3mr1512763qek.14.1362213473150; Sat, 02 Mar 2013 00:37:53 -0800 (PST) X-BeenThere: lojban@googlegroups.com Received: by 10.49.82.16 with SMTP id e16ls1392166qey.29.gmail; Sat, 02 Mar 2013 00:37:52 -0800 (PST) X-Received: by 10.49.96.196 with SMTP id du4mr1548471qeb.37.1362213472666; Sat, 02 Mar 2013 00:37:52 -0800 (PST) Date: Sat, 2 Mar 2013 00:37:52 -0800 (PST) From: la gleki To: lojban@googlegroups.com Cc: evarismb@gmail.com Message-Id: <36dc15ff-a4ff-486f-ae5e-f64e8a7728cc@googlegroups.com> In-Reply-To: References: <29741151.5374.1331043579316.JavaMail.geo-discussion-forums@vbkc1> <8f2d80fb-7cda-4645-854d-4f119e0d5726@l14g2000vbe.googlegroups.com> <20567224.17.1331117056640.JavaMail.geo-discussion-forums@ynic10> <85d85f4f-d5f5-4fe2-a278-c278b63bffe1@m2g2000vbc.googlegroups.com> <24b50624-5057-46e1-90c1-3b0ba4e4f9e5@gr6g2000vbb.googlegroups.com> <877cc974-305f-4763-8756-03768c19d643@s7g2000vby.googlegroups.com> <783963.269.1332348732955.JavaMail.geo-discussion-forums@vbbp15> Subject: [lojban] Re: How to export tatoeba in simple format MIME-Version: 1.0 X-Original-Sender: gleki.is.my.name@gmail.com Reply-To: lojban@googlegroups.com Precedence: list Mailing-list: list lojban@googlegroups.com; contact lojban+owners@googlegroups.com List-ID: X-Google-Group-Id: 1004133512417 List-Post: , List-Help: , List-Archive: Sender: lojban@googlegroups.com List-Subscribe: , List-Unsubscribe: , Content-Type: multipart/alternative; boundary="----=_Part_429_9978362.1362213472373" X-Spam-Score: -0.1 (/) X-Spam_score: -0.1 X-Spam_score_int: 0 X-Spam_bar: / ------=_Part_429_9978362.1362213472373 Content-Type: text/plain; charset=ISO-8859-2 Content-Transfer-Encoding: quoted-printable On Wednesday, January 2, 2013 11:00:34 PM UTC+4, evar...@gmail.com wrote: > > Hi, > I'm a german Teacher at a spanish University and i've tried to adapt your= =20 > script to download a bilingual csv (german-spanish) from tatoeba. The=20 > problem is i have absolute no programming/ linux knowledge and i can't=20 > figure out why this doesn't work. It would be very nice if you could give= =20 > me a hint how to do that.=20 > Thank you! > I suggest that you replace the sequence "jbo" in all files of the script to= =20 the sequence "deu" (the list of all language codescan be seen here ). Also open all the files of the script and replace "jbo" with "deu" there. Then add the downloaded database to the folder and run the script (if on=20 Windows you can use Cygwin). Note that the script os rather slow. It might take several hours to=20 complete it. > El mi=E9rcoles, 21 de marzo de 2012 17:52:12 UTC+1, ianek escribi=F3: >> >> OK, I've made it. http://dl.dropbox.com/u/17805197/parse-tatoeba.tar.gz >> Unpack it to a directory with links.csv and sentences.csv from Tatoeba. >> Run ./prepare-links.sh once. (You'll have to do it again only if you=20 >> replace links/setences with newer files). >> Then run ./make-pairs.sh [language-code] > [some filename].csv >> For example ./make-pairs.sh eng > jbo-eng.csv >> >> I've made it so that it gathers all of the interlinked sentences. This= =20 >> has some drawbacks. Do you know the "phone game"? If you do, you know wh= at=20 >> I'm saying. If you don't, you will know when you look at some pairs... >> >> mu'o mi'e ianek >> >> On Wednesday, March 7, 2012 7:36:44 PM UTC+1, ianek wrote: >>> >>> http://dl.dropbox.com/u/17805197/jbo-rus.csv=20 >>> >>> But it's probably not complete, for the reason I mentioned.=20 >>> >>> On 7 Mar, 19:32, ianek wrote:=20 >>> > I've just found out that links.csv is not complete, ie. it doesn't=20 >>> > cover all the pairs. For example, we have a Lojban sentence "lo purci= =20 >>> > ka'e te djuno gi'e na ka'e se galfi .i lo balvi ka'e se galfi gi'e na= =20 >>> > ka'e te djuno" and a Polish sentence "Przesz=B3o=B6=E6 mo=BFe by=E6 t= ylko=20 >>> > poznana, nie zmieniona. Przysz=B3o=B6=E6 mo=BFe by=E6 tylko zmieniona= , nie=20 >>> > poznana." and they're not linked to each other, but they both are=20 >>> > linked to "The past can only be known, not changed. The future can=20 >>> > only be changed, not known.". I wonder if there's a rule that such=20 >>> > sentence always have a "common relative", it would certainly make=20 >>> > things easier. But I think that now using a database (maybe sqlite3)= =20 >>> > would be necessary.=20 >>> >=20 >>> > mu'o mi'e ianek=20 >>> >=20 >>> > On 7 Mar, 15:51, ianek wrote:=20 >>> >=20 >>> >=20 >>> >=20 >>> >=20 >>> >=20 >>> >=20 >>> >=20 >>> > > What platform? Is Linux ok?=20 >>> >=20 >>> > > On 7 Mar, 11:44, gleki wrote:=20 >>> >=20 >>> > > > I'm interested. And actually in periodically doing it myself. No= t=20 >>> by=20 >>> > > > request.=20 >>> > > > Because the database is live and is being updated by us.=20 >>> >=20 >>> > > > Of course I know about those three files.=20 >>> >=20 >>> > > > For now, I'd prefer such export for several directions at one (a= =20 >>> > > > multilingual spreadsheet).=20 >>> > > > I want all sentences for which we have lojban translations.=20 >>> > > > i.e.=20 >>> > > > first column lojban=20 >>> > > > 2 column english=20 >>> > > > then i need=20 >>> > > > japanese=20 >>> > > > chinese=20 >>> > > > russian=20 >>> > > > arabic=20 >>> > > > spanish=20 >>> > > > polish=20 >>> > > > french=20 >>> > > > german=20 >>> >=20 >>> > > > I'll repeat once again. An automated script for doing so would b= e=20 >>> awesome.=20 >>> >=20 >>> > > > On Wednesday, March 7, 2012 2:47:17 AM UTC+4, ianek wrote:=20 >>> >=20 >>> > > > > I've created the list for you, but it was an ugly hack in bash.= =20 >>> A=20 >>> > > > > better way would be to create a database and import=20 >>> sentences.csv and=20 >>> > > > > links.csv to it, and then write a very simple program instead o= f=20 >>> > > > > hacking around with grep etc. But it would be more work of=20 >>> course. And=20 >>> > > > > maybe not faster, considering that import would take time.=20 >>> >=20 >>> > > > > Here you go:http://dl.dropbox.com/u/17805197/jbo-eng.csv=20 >>> > > > > It's tab-seperated list, any spreadsheet program should read it= .=20 >>> >=20 >>> > > > > As a by-product, I am able to produce such a list for any other= =20 >>> > > > > language available in tatoeba instantly, if anyone's interested= .=20 >>> >=20 >>> > > > > mu'o mi'e ianek=20 >>> >=20 >>> > > > > On 6 Mar, 22:17, ianek wrote:=20 >>> >=20 >>> > > > > >>> http://tatoeba.org/pol/download_tatoeba_example_sentenceshttp://tatoe..= .=20 >>> >>> >=20 >>> > > > > > There are actually three columns: id, language, sentence, but= =20 >>> with=20 >>> > > > > > some database-fu or script-fu or maybe even spreadsheet-fu yo= u=20 >>> can get=20 >>> > > > > > what you want. Or maybe I'll hack it together in a while.=20 >>> >=20 >>> > > > > > mu'o mi'e ianek=20 >>> >=20 >>> > > > > > On 6 Mar, 15:19, gleki wrote:=20 >>> >=20 >>> > > > > > > I wanna export tatoeba databse into a simple spreadsheet=20 >>> with two=20 >>> > > > > columns.=20 >>> > > > > > > One for English and another one for Lojban=20 >>> >=20 >>> > > > > > > Does anyone know how to do that ? >> >> --=20 You received this message because you are subscribed to the Google Groups "= lojban" group. To unsubscribe from this group and stop receiving emails from it, send an e= mail to lojban+unsubscribe@googlegroups.com. To post to this group, send email to lojban@googlegroups.com. Visit this group at http://groups.google.com/group/lojban?hl=3Den. For more options, visit https://groups.google.com/groups/opt_out. ------=_Part_429_9978362.1362213472373 Content-Type: text/html; charset=ISO-8859-2 Content-Transfer-Encoding: quoted-printable

On Wednesday, January 2, 2013 11:00:34 PM UTC+4, evar...@gmail.com = wrote:
Hi,
I'm a german Teac= her at a spanish University and i've tried to adapt your script to download= a bilingual csv (german-spanish) from tatoeba. The problem is i have absol= ute no programming/ linux knowledge and i can't figure out why this doesn't= work. It would be very nice if you could give me a hint how to do that. Thank you!

I suggest that you replace= the sequence "jbo" in all files of the script to the sequence "deu" (the l= ist of all language codescan be seen here).
Also open all the files of th= e script and replace "jbo" with "deu" there.

Then = add the downloaded database to the folder and run the script (if on Windows= you can use Cygwin).
Note that the script os rather slow. It mig= ht take several hours to complete it.


El mi=E9rcoles, 21 de marzo de 2012 17:52:1= 2 UTC+1, ianek escribi=F3:
OK, I've= made it. http://dl.dropbox.com/u/17805197/parse-tatoeba= .tar.gz
Unpack it to a directory with links.csv and sentences.csv f= rom Tatoeba.
Run ./prepare-links.sh once. (You'll have to do it a= gain only if you replace links/setences with newer files).
Then r= un ./make-pairs.sh [language-code] > [some filename].csv
For e= xample ./make-pairs.sh eng > jbo-eng.csv

I= 've made it so that it gathers all of the interlinked sentences. This has s= ome drawbacks. Do you know the "phone game"? If you do, you know what I'm s= aying. If you don't, you will know when you look at some pairs...

mu'o mi'e ianek

On Wednesday, March 7, 2012 7:36:44 = PM UTC+1, ianek wrote:
http://dl.dro= pbox.com/u/17805197/jbo-rus.csv

But it's probably not complete, for the reason I mentioned.

On 7 Mar, 19:32, ianek <jane...@gmail.com> wrote:
> I've just found out that links.csv is not complete, ie. it doesn't
> cover all the pairs. For example, we have a Lojban sentence "lo pu= rci
> ka'e te djuno gi'e na ka'e se galfi .i lo balvi ka'e se galfi gi'e= na
> ka'e te djuno" and a Polish sentence "Przesz=B3o=B6=E6 mo=BFe by= =E6 tylko
> poznana, nie zmieniona. Przysz=B3o=B6=E6 mo=BFe by=E6 tylko zmieni= ona, nie
> poznana." and they're not linked to each other, but they both are
> linked to "The past can only be known, not changed. The future can
> only be changed, not known.". I wonder if there's a rule that such
> sentence always have a "common relative", it would certainly make
> things easier. But I think that now using a database (maybe sqlite= 3)
> would be necessary.
>
> mu'o mi'e ianek
>
> On 7 Mar, 15:51, ianek <jane...@gmail.com> wrote:
>
>
>
>
>
>
>
> > What platform? Is Linux ok?
>
> > On 7 Mar, 11:44, gleki <gleki.is.my.n...@gmail.com&= gt; wrote:
>
> > > I'm interested. And actually in periodically doing it my= self.  Not by
> > > request.
> > > Because the database is live and is being updated by us.
>
> > > Of course I know about those three files.
>
> > > For now, I'd prefer such export for several directions a= t one (a
> > > multilingual spreadsheet).
> > > I want all sentences for which we have lojban translatio= ns.
> > > i.e.
> > > first column    lojban
> > > 2 column   english
> > > then i need
> > > japanese
> > > chinese
> > > russian
> > > arabic
> > > spanish
> > > polish
> > > french
> > > german
>
> > > I'll repeat once again. An automated script for doing so=  would be awesome.
>
> > > On Wednesday, March 7, 2012 2:47:17 AM UTC+4, ianek wrot= e:
>
> > > > I've created the list for you, but it was an ugly h= ack in bash. A
> > > > better way would be to create a database and import= sentences.csv and
> > > > links.csv to it, and then write a very simple progr= am instead of
> > > > hacking around with grep etc. But it would be more = work of course. And
> > > > maybe not faster, considering that import would tak= e time.
>
> > > > Here you go:http://dl.dropbox.com/u/17805197/j= bo-eng.csv
> > > > It's tab-seperated list, any spreadsheet program sh= ould read it.
>
> > > > As a by-product, I am able to produce such a list f= or any other
> > > > language available in tatoeba instantly, if anyone'= s interested.
>
> > > > mu'o mi'e ianek
>
> > > > On 6 Mar, 22:17, ianek <jane...@gmail.com= > wrote:
>
> > > >http://tatoeba.org/pol/download_tatoeba_example_sentenceshttp://tatoe...
>
> > > > > There are actually three columns: id, language= , sentence, but with
> > > > > some database-fu or script-fu or maybe even sp= readsheet-fu you can get
> > > > > what you want. Or maybe I'll hack it together = in a while.
>
> > > > > mu'o mi'e ianek
>
> > > > > On 6 Mar, 15:19, gleki <gleki.is.my.n...= @gmail.com> wrote:
>
> > > > > > I wanna export tatoeba databse into a sim= ple spreadsheet with two
> > > > columns.
> > > > > > One for English and another one for Lojba= n
>
> > > > > > Does anyone know how to do that ?

--
You received this message because you are subscribed to the Google Groups &= quot;lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to lojban+unsubscribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at http://groups.google.com/group/lojban?hl=3Den.
For more options, visit https://groups.google.com/groups/opt_out.
 
 
------=_Part_429_9978362.1362213472373--