Received: from mail-vb0-f61.google.com ([209.85.212.61]:38028) by stodi.digitalkingdom.org with esmtps (TLSv1:RC4-SHA:128) (Exim 4.76) (envelope-from ) id 1SAOlv-0005bD-Cz; Wed, 21 Mar 2012 09:52:35 -0700 Received: by vbbfd1 with SMTP id fd1sf783338vbb.16 for ; Wed, 21 Mar 2012 09:52:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=beta; h=x-beenthere:date:from:to:message-id:in-reply-to:references:subject :mime-version:x-original-sender:x-original-authentication-results :reply-to:precedence:mailing-list:list-id:x-google-group-id :list-post:list-help:list-archive:sender:list-subscribe :list-unsubscribe:content-type; bh=5l4tH1O5iytGhFd2+mGsEdbhAdUN6dVtPGEmwCv8xPo=; b=qoiICVwAFidOwdJBHUiDYTvQsWgl9d+NZrOp6fgUNo9CAprEdyujYt23bi5oXkFNem A3q0M24WNKF2SXVhqLypbFdQo3tudsOIkBCHzQ/8B3T1pRGg9eXc4AMi4syRi2hTMqWA +sf/SMUs32h67CpN4eWf6miISgkk6liLvWGZs= Received: by 10.52.71.18 with SMTP id q18mr677145vdu.14.1332348735034; Wed, 21 Mar 2012 09:52:15 -0700 (PDT) X-BeenThere: lojban@googlegroups.com Received: by 10.52.74.67 with SMTP id r3ls416049vdv.9.gmail; Wed, 21 Mar 2012 09:52:13 -0700 (PDT) Received: by 10.52.93.116 with SMTP id ct20mr654743vdb.20.1332348733516; Wed, 21 Mar 2012 09:52:13 -0700 (PDT) Date: Wed, 21 Mar 2012 09:52:12 -0700 (PDT) From: ianek To: lojban@googlegroups.com Message-ID: <783963.269.1332348732955.JavaMail.geo-discussion-forums@vbbp15> In-Reply-To: <877cc974-305f-4763-8756-03768c19d643@s7g2000vby.googlegroups.com> References: <29741151.5374.1331043579316.JavaMail.geo-discussion-forums@vbkc1> <8f2d80fb-7cda-4645-854d-4f119e0d5726@l14g2000vbe.googlegroups.com> <20567224.17.1331117056640.JavaMail.geo-discussion-forums@ynic10> <85d85f4f-d5f5-4fe2-a278-c278b63bffe1@m2g2000vbc.googlegroups.com> <24b50624-5057-46e1-90c1-3b0ba4e4f9e5@gr6g2000vbb.googlegroups.com> <877cc974-305f-4763-8756-03768c19d643@s7g2000vby.googlegroups.com> Subject: [lojban] Re: How to export tatoeba in simple format MIME-Version: 1.0 X-Original-Sender: janek37@gmail.com X-Original-Authentication-Results: ls.google.com; spf=pass (google.com: domain of janek37@gmail.com designates internal as permitted sender) smtp.mail=janek37@gmail.com; dkim=pass header.i=@gmail.com Reply-To: lojban@googlegroups.com Precedence: list Mailing-list: list lojban@googlegroups.com; contact lojban+owners@googlegroups.com List-ID: X-Google-Group-Id: 1004133512417 List-Post: , List-Help: , List-Archive: Sender: lojban@googlegroups.com List-Subscribe: , List-Unsubscribe: , Content-Type: multipart/alternative; boundary="----=_Part_268_27020078.1332348732953" X-Spam-Score: -0.7 (/) X-Spam_score: -0.7 X-Spam_score_int: -6 X-Spam_bar: / ------=_Part_268_27020078.1332348732953 Content-Type: text/plain; charset=ISO-8859-2 Content-Transfer-Encoding: quoted-printable OK, I've made it. http://dl.dropbox.com/u/17805197/parse-tatoeba.tar.gz Unpack it to a directory with links.csv and sentences.csv from Tatoeba. Run ./prepare-links.sh once. (You'll have to do it again only if you=20 replace links/setences with newer files). Then run ./make-pairs.sh [language-code] > [some filename].csv For example ./make-pairs.sh eng > jbo-eng.csv I've made it so that it gathers all of the interlinked sentences. This has= =20 some drawbacks. Do you know the "phone game"? If you do, you know what I'm= =20 saying. If you don't, you will know when you look at some pairs... mu'o mi'e ianek On Wednesday, March 7, 2012 7:36:44 PM UTC+1, ianek wrote: > > http://dl.dropbox.com/u/17805197/jbo-rus.csv=20 > > But it's probably not complete, for the reason I mentioned.=20 > > On 7 Mar, 19:32, ianek wrote:=20 > > I've just found out that links.csv is not complete, ie. it doesn't=20 > > cover all the pairs. For example, we have a Lojban sentence "lo purci= =20 > > ka'e te djuno gi'e na ka'e se galfi .i lo balvi ka'e se galfi gi'e na= =20 > > ka'e te djuno" and a Polish sentence "Przesz=B3o=B6=E6 mo=BFe by=E6 tyl= ko=20 > > poznana, nie zmieniona. Przysz=B3o=B6=E6 mo=BFe by=E6 tylko zmieniona, = nie=20 > > poznana." and they're not linked to each other, but they both are=20 > > linked to "The past can only be known, not changed. The future can=20 > > only be changed, not known.". I wonder if there's a rule that such=20 > > sentence always have a "common relative", it would certainly make=20 > > things easier. But I think that now using a database (maybe sqlite3)=20 > > would be necessary.=20 > >=20 > > mu'o mi'e ianek=20 > >=20 > > On 7 Mar, 15:51, ianek wrote:=20 > >=20 > >=20 > >=20 > >=20 > >=20 > >=20 > >=20 > > > What platform? Is Linux ok?=20 > >=20 > > > On 7 Mar, 11:44, gleki wrote:=20 > >=20 > > > > I'm interested. And actually in periodically doing it myself. Not= =20 > by=20 > > > > request.=20 > > > > Because the database is live and is being updated by us.=20 > >=20 > > > > Of course I know about those three files.=20 > >=20 > > > > For now, I'd prefer such export for several directions at one (a=20 > > > > multilingual spreadsheet).=20 > > > > I want all sentences for which we have lojban translations.=20 > > > > i.e.=20 > > > > first column lojban=20 > > > > 2 column english=20 > > > > then i need=20 > > > > japanese=20 > > > > chinese=20 > > > > russian=20 > > > > arabic=20 > > > > spanish=20 > > > > polish=20 > > > > french=20 > > > > german=20 > >=20 > > > > I'll repeat once again. An automated script for doing so would be= =20 > awesome.=20 > >=20 > > > > On Wednesday, March 7, 2012 2:47:17 AM UTC+4, ianek wrote:=20 > >=20 > > > > > I've created the list for you, but it was an ugly hack in bash. A= =20 > > > > > better way would be to create a database and import sentences.csv= =20 > and=20 > > > > > links.csv to it, and then write a very simple program instead of= =20 > > > > > hacking around with grep etc. But it would be more work of course= .=20 > And=20 > > > > > maybe not faster, considering that import would take time.=20 > >=20 > > > > > Here you go:http://dl.dropbox.com/u/17805197/jbo-eng.csv=20 > > > > > It's tab-seperated list, any spreadsheet program should read it.= =20 > >=20 > > > > > As a by-product, I am able to produce such a list for any other= =20 > > > > > language available in tatoeba instantly, if anyone's interested.= =20 > >=20 > > > > > mu'o mi'e ianek=20 > >=20 > > > > > On 6 Mar, 22:17, ianek wrote:=20 > >=20 > > > > > > http://tatoeba.org/pol/download_tatoeba_example_sentenceshttp://tatoe...= =20 > >=20 > > > > > > There are actually three columns: id, language, sentence, but= =20 > with=20 > > > > > > some database-fu or script-fu or maybe even spreadsheet-fu you= =20 > can get=20 > > > > > > what you want. Or maybe I'll hack it together in a while.=20 > >=20 > > > > > > mu'o mi'e ianek=20 > >=20 > > > > > > On 6 Mar, 15:19, gleki wrote:=20 > >=20 > > > > > > > I wanna export tatoeba databse into a simple spreadsheet with= =20 > two=20 > > > > > columns.=20 > > > > > > > One for English and another one for Lojban=20 > >=20 > > > > > > > Does anyone know how to do that ? --=20 You received this message because you are subscribed to the Google Groups "= lojban" group. To view this discussion on the web visit https://groups.google.com/d/msg/lo= jban/-/PLp6H0iMVuIJ. To post to this group, send email to lojban@googlegroups.com. To unsubscribe from this group, send email to lojban+unsubscribe@googlegrou= ps.com. For more options, visit this group at http://groups.google.com/group/lojban= ?hl=3Den. ------=_Part_268_27020078.1332348732953 Content-Type: text/html; charset=ISO-8859-2 Content-Transfer-Encoding: quoted-printable OK, I've made it. http://dl.dropbox.com/u/17805197/parse-tatoeba.tar.g= z
Unpack it to a directory with links.csv and sentences.csv from Tatoeb= a.
Run ./prepare-links.sh once. (You'll have to do it again only = if you replace links/setences with newer files).
Then run ./make-= pairs.sh [language-code] > [some filename].csv
For example&nbs= p;./make-pairs.sh eng > jbo-eng.csv

I've made i= t so that it gathers all of the interlinked sentences. This has some drawba= cks. Do you know the "phone game"? If you do, you know what I'm saying. If = you don't, you will know when you look at some pairs...

mu'o mi'e ianek

On Wednesday, March 7, 2012 7:36:44 PM UTC+1, = ianek wrote:
http://dl.dropbox.= com/u/17805197/jbo-rus.csv

But it's probably not complete, for the reason I mentioned.

On 7 Mar, 19:32, ianek <jane...@gmail.com> wrote:
> I've just found out that links.csv is not complete, ie. it doesn't
> cover all the pairs. For example, we have a Lojban sentence "lo pu= rci
> ka'e te djuno gi'e na ka'e se galfi .i lo balvi ka'e se galfi gi'e= na
> ka'e te djuno" and a Polish sentence "Przesz=B3o=B6=E6 mo=BFe by= =E6 tylko
> poznana, nie zmieniona. Przysz=B3o=B6=E6 mo=BFe by=E6 tylko zmieni= ona, nie
> poznana." and they're not linked to each other, but they both are
> linked to "The past can only be known, not changed. The future can
> only be changed, not known.". I wonder if there's a rule that such
> sentence always have a "common relative", it would certainly make
> things easier. But I think that now using a database (maybe sqlite= 3)
> would be necessary.
>
> mu'o mi'e ianek
>
> On 7 Mar, 15:51, ianek <jane...@gmail.com> wrote:
>
>
>
>
>
>
>
> > What platform? Is Linux ok?
>
> > On 7 Mar, 11:44, gleki <gleki.is.my.n...@gmail.com&= gt; wrote:
>
> > > I'm interested. And actually in periodically doing it my= self.  Not by
> > > request.
> > > Because the database is live and is being updated by us.
>
> > > Of course I know about those three files.
>
> > > For now, I'd prefer such export for several directions a= t one (a
> > > multilingual spreadsheet).
> > > I want all sentences for which we have lojban translatio= ns.
> > > i.e.
> > > first column    lojban
> > > 2 column   english
> > > then i need
> > > japanese
> > > chinese
> > > russian
> > > arabic
> > > spanish
> > > polish
> > > french
> > > german
>
> > > I'll repeat once again. An automated script for doing so=  would be awesome.
>
> > > On Wednesday, March 7, 2012 2:47:17 AM UTC+4, ianek wrot= e:
>
> > > > I've created the list for you, but it was an ugly h= ack in bash. A
> > > > better way would be to create a database and import= sentences.csv and
> > > > links.csv to it, and then write a very simple progr= am instead of
> > > > hacking around with grep etc. But it would be more = work of course. And
> > > > maybe not faster, considering that import would tak= e time.
>
> > > > Here you go:http://dl.dropbox.com/u/17805197/j= bo-eng.csv
> > > > It's tab-seperated list, any spreadsheet program sh= ould read it.
>
> > > > As a by-product, I am able to produce such a list f= or any other
> > > > language available in tatoeba instantly, if anyone'= s interested.
>
> > > > mu'o mi'e ianek
>
> > > > On 6 Mar, 22:17, ianek <jane...@gmail.com= > wrote:
>
> > > >http://tatoeba.org/pol/download_tatoeba_example_sentenceshttp://tatoe...
>
> > > > > There are actually three columns: id, language= , sentence, but with
> > > > > some database-fu or script-fu or maybe even sp= readsheet-fu you can get
> > > > > what you want. Or maybe I'll hack it together = in a while.
>
> > > > > mu'o mi'e ianek
>
> > > > > On 6 Mar, 15:19, gleki <gleki.is.my.n...= @gmail.com> wrote:
>
> > > > > > I wanna export tatoeba databse into a sim= ple spreadsheet with two
> > > > columns.
> > > > > > One for English and another one for Lojba= n
>
> > > > > > Does anyone know how to do that ?

--
You received this message because you are subscribed to the Google Groups "= lojban" group.
To view this discussion on the web visit https://groups.google.com/d/msg/lojban/-/PL= p6H0iMVuIJ.
=20 To post to this group, send email to lojban@googlegroups.com.
To unsubscribe from this group, send email to lojban+unsubscribe@googlegrou= ps.com.
For more options, visit this group at http://groups.google.com/group/lojban= ?hl=3Den.
------=_Part_268_27020078.1332348732953--