Received: from mail-gh0-f191.google.com ([209.85.160.191]:63817) by stodi.digitalkingdom.org with esmtps (TLSv1:RC4-SHA:128) (Exim 4.76) (envelope-from ) id 1UeR3L-0001sR-Qg for lojban-list-archive@lojban.org; Mon, 20 May 2013 07:27:31 -0700 Received: by mail-gh0-f191.google.com with SMTP id f13sf1194681ghb.18 for ; Mon, 20 May 2013 07:27:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20120806; h=x-received:x-beenthere:x-received:date:from:to:message-id :in-reply-to:references:subject:mime-version:x-original-sender :reply-to:precedence:mailing-list:list-id:x-google-group-id :list-post:list-help:list-archive:sender:list-subscribe :list-unsubscribe:content-type; bh=PZ+JJOY8OBGwl71cuCzhzQpHPTQYZ2njJF/raYRGxHw=; b=C22I8uTDpHAWlPoq0W5FqIg8pAA/bdJ+j24k0Ua72dv+FCweC4K1T3qwY/C8xiCBmm bRmqQmUZx7LX1tgZXwwp1Q1Tl7Knk/0TBxC1h+3B6QynkEfeicMjV4NRiPyrRC1B0MH8 RRSPT0+QTXEOhFOvKS22yCLZKnjvaAgGyZNqmk28uJiR10r/v4oWJCybvRoEqzhWksAV aHEW/K3qcYx4BIiKBRvg5FWrX29FV7Xi7ZfMe9jroNHuqW6wcuQ3NYiHXGd3HM4z7e0i RpdWTnd66dZBlR4Uzyiz4G/uZyvNvIlxbZovRjAZ/vwQrMexHR03rXP9S3eho1Gm1zfv rUcA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:x-beenthere:x-received:date:from:to:message-id :in-reply-to:references:subject:mime-version:x-original-sender :reply-to:precedence:mailing-list:list-id:x-google-group-id :list-post:list-help:list-archive:sender:list-subscribe :list-unsubscribe:content-type; bh=PZ+JJOY8OBGwl71cuCzhzQpHPTQYZ2njJF/raYRGxHw=; b=PKpKxF2w6AMA9bQ4mg6fbYPaoATipkoTeDCYkZ1cmJBZ66XV1AdQBjaTE1YwxMePq1 MIvuaOL7Tlox7XDH7y7qU6FHQ88FNF3THr0GTPQod8MvIn1gb7z/qDseTJpR9QXXk43e dCz6jPk9a8CqsyuPh3Rw1T9sHjqIRwq10fkUKpaUDjCKqwx4jgPPqF845S/u+Wh01duJ AAMIwFPi3l1KRTE0WCZ2acrFHfh8QnPI7GExlXmIzEwE3q6F9lEKDVtrF6VtJxTo8RNY HeH7SJ/Af6ZzLjkOHCQwy/PP1TVNqhbUtEjms0FHsGyze4uZDAPuIybuvnCRPp7CtthO c2kQ== X-Received: by 10.49.0.81 with SMTP id 17mr137450qec.16.1369060021761; Mon, 20 May 2013 07:27:01 -0700 (PDT) X-BeenThere: lojban@googlegroups.com Received: by 10.49.134.34 with SMTP id ph2ls2953313qeb.53.gmail; Mon, 20 May 2013 07:27:00 -0700 (PDT) X-Received: by 10.49.71.135 with SMTP id v7mr4193519qeu.22.1369060020777; Mon, 20 May 2013 07:27:00 -0700 (PDT) Date: Mon, 20 May 2013 07:27:00 -0700 (PDT) From: la gleki To: lojban@googlegroups.com Message-Id: <8ade752b-873d-4bcc-a940-7829434bc92b@googlegroups.com> In-Reply-To: <783963.269.1332348732955.JavaMail.geo-discussion-forums@vbbp15> References: <29741151.5374.1331043579316.JavaMail.geo-discussion-forums@vbkc1> <8f2d80fb-7cda-4645-854d-4f119e0d5726@l14g2000vbe.googlegroups.com> <20567224.17.1331117056640.JavaMail.geo-discussion-forums@ynic10> <85d85f4f-d5f5-4fe2-a278-c278b63bffe1@m2g2000vbc.googlegroups.com> <24b50624-5057-46e1-90c1-3b0ba4e4f9e5@gr6g2000vbb.googlegroups.com> <877cc974-305f-4763-8756-03768c19d643@s7g2000vby.googlegroups.com> <783963.269.1332348732955.JavaMail.geo-discussion-forums@vbbp15> Subject: [lojban] Re: How to export tatoeba in simple format MIME-Version: 1.0 X-Original-Sender: gleki.is.my.name@gmail.com Reply-To: lojban@googlegroups.com Precedence: list Mailing-list: list lojban@googlegroups.com; contact lojban+owners@googlegroups.com List-ID: X-Google-Group-Id: 1004133512417 List-Post: , List-Help: , List-Archive: Sender: lojban@googlegroups.com List-Subscribe: , List-Unsubscribe: , Content-Type: multipart/alternative; boundary="----=_Part_378_11126777.1369060020379" X-Spam-Score: -0.1 (/) X-Spam_score: -0.1 X-Spam_score_int: 0 X-Spam_bar: / ------=_Part_378_11126777.1369060020379 Content-Type: text/plain; charset=ISO-8859-2 Content-Transfer-Encoding: quoted-printable On Wednesday, March 21, 2012 8:52:12 PM UTC+4, ianek wrote: > > OK, I've made it. http://dl.dropbox.com/u/17805197/parse-tatoeba.tar.gz > Unpack it to a directory with links.csv and sentences.csv from Tatoeba. > Run ./prepare-links.sh once. (You'll have to do it again only if you=20 > replace links/setences with newer files). > Then run ./make-pairs.sh [language-code] > [some filename].csv > For example ./make-pairs.sh eng > jbo-eng.csv > > I've made it so that it gathers all of the interlinked sentences. This ha= s=20 > some drawbacks. Do you know the "phone game"? If you do, you know what I'= m=20 > saying. If you don't, you will know when you look at some pairs... > This is a great script. But can we have another one with only direct=20 translations to remove that broken phone game effect? Also can we have a script that will link indirect translations only a given= =20 (e.g. 1) level deep? > > mu'o mi'e ianek > > On Wednesday, March 7, 2012 7:36:44 PM UTC+1, ianek wrote: >> >> http://dl.dropbox.com/u/17805197/jbo-rus.csv=20 >> >> But it's probably not complete, for the reason I mentioned.=20 >> >> On 7 Mar, 19:32, ianek wrote:=20 >> > I've just found out that links.csv is not complete, ie. it doesn't=20 >> > cover all the pairs. For example, we have a Lojban sentence "lo purci= =20 >> > ka'e te djuno gi'e na ka'e se galfi .i lo balvi ka'e se galfi gi'e na= =20 >> > ka'e te djuno" and a Polish sentence "Przesz=B3o=B6=E6 mo=BFe by=E6 ty= lko=20 >> > poznana, nie zmieniona. Przysz=B3o=B6=E6 mo=BFe by=E6 tylko zmieniona,= nie=20 >> > poznana." and they're not linked to each other, but they both are=20 >> > linked to "The past can only be known, not changed. The future can=20 >> > only be changed, not known.". I wonder if there's a rule that such=20 >> > sentence always have a "common relative", it would certainly make=20 >> > things easier. But I think that now using a database (maybe sqlite3)= =20 >> > would be necessary.=20 >> >=20 >> > mu'o mi'e ianek=20 >> >=20 >> > On 7 Mar, 15:51, ianek wrote:=20 >> >=20 >> >=20 >> >=20 >> >=20 >> >=20 >> >=20 >> >=20 >> > > What platform? Is Linux ok?=20 >> >=20 >> > > On 7 Mar, 11:44, gleki wrote:=20 >> >=20 >> > > > I'm interested. And actually in periodically doing it myself. Not= =20 >> by=20 >> > > > request.=20 >> > > > Because the database is live and is being updated by us.=20 >> >=20 >> > > > Of course I know about those three files.=20 >> >=20 >> > > > For now, I'd prefer such export for several directions at one (a= =20 >> > > > multilingual spreadsheet).=20 >> > > > I want all sentences for which we have lojban translations.=20 >> > > > i.e.=20 >> > > > first column lojban=20 >> > > > 2 column english=20 >> > > > then i need=20 >> > > > japanese=20 >> > > > chinese=20 >> > > > russian=20 >> > > > arabic=20 >> > > > spanish=20 >> > > > polish=20 >> > > > french=20 >> > > > german=20 >> >=20 >> > > > I'll repeat once again. An automated script for doing so would be= =20 >> awesome.=20 >> >=20 >> > > > On Wednesday, March 7, 2012 2:47:17 AM UTC+4, ianek wrote:=20 >> >=20 >> > > > > I've created the list for you, but it was an ugly hack in bash. = A=20 >> > > > > better way would be to create a database and import sentences.cs= v=20 >> and=20 >> > > > > links.csv to it, and then write a very simple program instead of= =20 >> > > > > hacking around with grep etc. But it would be more work of=20 >> course. And=20 >> > > > > maybe not faster, considering that import would take time.=20 >> >=20 >> > > > > Here you go:http://dl.dropbox.com/u/17805197/jbo-eng.csv=20 >> > > > > It's tab-seperated list, any spreadsheet program should read it.= =20 >> >=20 >> > > > > As a by-product, I am able to produce such a list for any other= =20 >> > > > > language available in tatoeba instantly, if anyone's interested.= =20 >> >=20 >> > > > > mu'o mi'e ianek=20 >> >=20 >> > > > > On 6 Mar, 22:17, ianek wrote:=20 >> >=20 >> > > > > >> http://tatoeba.org/pol/download_tatoeba_example_sentenceshttp://tatoe...= =20 >> >=20 >> > > > > > There are actually three columns: id, language, sentence, but= =20 >> with=20 >> > > > > > some database-fu or script-fu or maybe even spreadsheet-fu you= =20 >> can get=20 >> > > > > > what you want. Or maybe I'll hack it together in a while.=20 >> >=20 >> > > > > > mu'o mi'e ianek=20 >> >=20 >> > > > > > On 6 Mar, 15:19, gleki wrote:=20 >> >=20 >> > > > > > > I wanna export tatoeba databse into a simple spreadsheet wit= h=20 >> two=20 >> > > > > columns.=20 >> > > > > > > One for English and another one for Lojban=20 >> >=20 >> > > > > > > Does anyone know how to do that ? > > --=20 You received this message because you are subscribed to the Google Groups "= lojban" group. To unsubscribe from this group and stop receiving emails from it, send an e= mail to lojban+unsubscribe@googlegroups.com. To post to this group, send email to lojban@googlegroups.com. Visit this group at http://groups.google.com/group/lojban?hl=3Den. For more options, visit https://groups.google.com/groups/opt_out. ------=_Part_378_11126777.1369060020379 Content-Type: text/html; charset=ISO-8859-2 Content-Transfer-Encoding: quoted-printable

On Wednesday, March 21, 2012 8:52:12 PM UTC+4, ianek wrote:OK, I've made it. http:/= /dl.dropbox.com/u/17805197/parse-tatoeba.tar.gz
Unpack it to a= directory with links.csv and sentences.csv from Tatoeba.
Run ./p= repare-links.sh once. (You'll have to do it again only if you replace links= /setences with newer files).
Then run ./make-pairs.sh [language-c= ode] > [some filename].csv
For example ./make-pairs.sh en= g > jbo-eng.csv

I've made it so that it gathers= all of the interlinked sentences. This has some drawbacks. Do you know the= "phone game"? If you do, you know what I'm saying. If you don't, you will = know when you look at some pairs...

<= br>
This is a great script. But can we have another one with only= direct translations to remove that broken phone game effect?
Als= o can we have a script that will link indirect translations only a given (e= .g. 1) level deep?
<= br>
mu'o mi'e ianek

On Wednesday, March 7, 2012 7:36:44 PM= UTC+1, ianek wrote:
http://dl.dropb= ox.com/u/17805197/jbo-rus.csv

But it's probably not complete, for the reason I mentioned.

On 7 Mar, 19:32, ianek <jane...@gmail.com> wrote:
> I've just found out that links.csv is not complete, ie. it doesn't
> cover all the pairs. For example, we have a Lojban sentence "lo pu= rci
> ka'e te djuno gi'e na ka'e se galfi .i lo balvi ka'e se galfi gi'e= na
> ka'e te djuno" and a Polish sentence "Przesz=B3o=B6=E6 mo=BFe by= =E6 tylko
> poznana, nie zmieniona. Przysz=B3o=B6=E6 mo=BFe by=E6 tylko zmieni= ona, nie
> poznana." and they're not linked to each other, but they both are
> linked to "The past can only be known, not changed. The future can
> only be changed, not known.". I wonder if there's a rule that such
> sentence always have a "common relative", it would certainly make
> things easier. But I think that now using a database (maybe sqlite= 3)
> would be necessary.
>
> mu'o mi'e ianek
>
> On 7 Mar, 15:51, ianek <jane...@gmail.com> wrote:
>
>
>
>
>
>
>
> > What platform? Is Linux ok?
>
> > On 7 Mar, 11:44, gleki <gleki.is.my.n...@gmail.com&= gt; wrote:
>
> > > I'm interested. And actually in periodically doing it my= self.  Not by
> > > request.
> > > Because the database is live and is being updated by us.
>
> > > Of course I know about those three files.
>
> > > For now, I'd prefer such export for several directions a= t one (a
> > > multilingual spreadsheet).
> > > I want all sentences for which we have lojban translatio= ns.
> > > i.e.
> > > first column    lojban
> > > 2 column   english
> > > then i need
> > > japanese
> > > chinese
> > > russian
> > > arabic
> > > spanish
> > > polish
> > > french
> > > german
>
> > > I'll repeat once again. An automated script for doing so=  would be awesome.
>
> > > On Wednesday, March 7, 2012 2:47:17 AM UTC+4, ianek wrot= e:
>
> > > > I've created the list for you, but it was an ugly h= ack in bash. A
> > > > better way would be to create a database and import= sentences.csv and
> > > > links.csv to it, and then write a very simple progr= am instead of
> > > > hacking around with grep etc. But it would be more = work of course. And
> > > > maybe not faster, considering that import would tak= e time.
>
> > > > Here you go:http://dl.dropbox.com/u/17805197/j= bo-eng.csv
> > > > It's tab-seperated list, any spreadsheet program sh= ould read it.
>
> > > > As a by-product, I am able to produce such a list f= or any other
> > > > language available in tatoeba instantly, if anyone'= s interested.
>
> > > > mu'o mi'e ianek
>
> > > > On 6 Mar, 22:17, ianek <jane...@gmail.com= > wrote:
>
> > > >http://tatoeba.org/pol/download_tatoeba_example_sentenceshttp://tatoe...
>
> > > > > There are actually three columns: id, language= , sentence, but with
> > > > > some database-fu or script-fu or maybe even sp= readsheet-fu you can get
> > > > > what you want. Or maybe I'll hack it together = in a while.
>
> > > > > mu'o mi'e ianek
>
> > > > > On 6 Mar, 15:19, gleki <gleki.is.my.n...= @gmail.com> wrote:
>
> > > > > > I wanna export tatoeba databse into a sim= ple spreadsheet with two
> > > > columns.
> > > > > > One for English and another one for Lojba= n
>
> > > > > > Does anyone know how to do that ?

--
You received this message because you are subscribed to the Google Groups &= quot;lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to lojban+unsubscribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at http://groups.google.com/group/lojban?hl=3Den.
For more options, visit https://groups.google.com/groups/opt_out.
 
 
------=_Part_378_11126777.1369060020379--