Date: Mon, 20 May 2013 07:27:00 -0700 (PDT)
From: la gleki <gleki.is.my.name@gmail.com>
To: lojban@googlegroups.com
Message-Id: <8ade752b-873d-4bcc-a940-7829434bc92b@googlegroups.com>
In-Reply-To: <783963.269.1332348732955.JavaMail.geo-discussion-forums@vbbp15>
References: <29741151.5374.1331043579316.JavaMail.geo-discussion-forums@vbkc1>
 <d9078ebe-3488-4fe0-adbd-d9f60abea5a7@q11g2000vbu.googlegroups.com>
 <8f2d80fb-7cda-4645-854d-4f119e0d5726@l14g2000vbe.googlegroups.com>
 <20567224.17.1331117056640.JavaMail.geo-discussion-forums@ynic10>
 <85d85f4f-d5f5-4fe2-a278-c278b63bffe1@m2g2000vbc.googlegroups.com> <24b50624-5057-46e1-90c1-3b0ba4e4f9e5@gr6g2000vbb.googlegroups.com>
 <877cc974-305f-4763-8756-03768c19d643@s7g2000vby.googlegroups.com>
 <783963.269.1332348732955.JavaMail.geo-discussion-forums@vbbp15>
Subject: [lojban] Re: How to export tatoeba in simple format
MIME-Version: 1.0
Reply-To: lojban@googlegroups.com
Precedence: list
Mailing-list: list lojban@googlegroups.com; contact lojban+owners@googlegroups.com
Sender: lojban@googlegroups.com
Content-Type: multipart/alternative; 
	boundary="----=_Part_378_11126777.1369060020379"
X-Spam_score: -0.1
X-Spam_score_int: 0
X-Spam_bar: /

------=_Part_378_11126777.1369060020379
Content-Type: text/plain; charset=ISO-8859-2
Content-Transfer-Encoding: quoted-printable


On Wednesday, March 21, 2012 8:52:12 PM UTC+4, ianek wrote:
>
> OK, I've made it. http://dl.dropbox.com/u/17805197/parse-tatoeba.tar.gz
> Unpack it to a directory with links.csv and sentences.csv from Tatoeba.
> Run ./prepare-links.sh once. (You'll have to do it again only if you=20
> replace links/setences with newer files).
> Then run ./make-pairs.sh [language-code] > [some filename].csv
> For example ./make-pairs.sh eng > jbo-eng.csv
>
> I've made it so that it gathers all of the interlinked sentences. This ha=
s=20
> some drawbacks. Do you know the "phone game"? If you do, you know what I'=
m=20
> saying. If you don't, you will know when you look at some pairs...
>


This is a great script. But can we have another one with only direct=20
translations to remove that broken phone game effect?
Also can we have a script that will link indirect translations only a given=
=20
(e.g. 1) level deep?

>
> mu'o mi'e ianek
>
> On Wednesday, March 7, 2012 7:36:44 PM UTC+1, ianek wrote:
>>
>> http://dl.dropbox.com/u/17805197/jbo-rus.csv=20
>>
>> But it's probably not complete, for the reason I mentioned.=20
>>
>> On 7 Mar, 19:32, ianek <jane...@gmail.com> wrote:=20
>> > I've just found out that links.csv is not complete, ie. it doesn't=20
>> > cover all the pairs. For example, we have a Lojban sentence "lo purci=
=20
>> > ka'e te djuno gi'e na ka'e se galfi .i lo balvi ka'e se galfi gi'e na=
=20
>> > ka'e te djuno" and a Polish sentence "Przesz=B3o=B6=E6 mo=BFe by=E6 ty=
lko=20
>> > poznana, nie zmieniona. Przysz=B3o=B6=E6 mo=BFe by=E6 tylko zmieniona,=
 nie=20
>> > poznana." and they're not linked to each other, but they both are=20
>> > linked to "The past can only be known, not changed. The future can=20
>> > only be changed, not known.". I wonder if there's a rule that such=20
>> > sentence always have a "common relative", it would certainly make=20
>> > things easier. But I think that now using a database (maybe sqlite3)=
=20
>> > would be necessary.=20
>> >=20
>> > mu'o mi'e ianek=20
>> >=20
>> > On 7 Mar, 15:51, ianek <jane...@gmail.com> wrote:=20
>> >=20
>> >=20
>> >=20
>> >=20
>> >=20
>> >=20
>> >=20
>> > > What platform? Is Linux ok?=20
>> >=20
>> > > On 7 Mar, 11:44, gleki <gleki.is.my.n...@gmail.com> wrote:=20
>> >=20
>> > > > I'm interested. And actually in periodically doing it myself.  Not=
=20
>> by=20
>> > > > request.=20
>> > > > Because the database is live and is being updated by us.=20
>> >=20
>> > > > Of course I know about those three files.=20
>> >=20
>> > > > For now, I'd prefer such export for several directions at one (a=
=20
>> > > > multilingual spreadsheet).=20
>> > > > I want all sentences for which we have lojban translations.=20
>> > > > i.e.=20
>> > > > first column    lojban=20
>> > > > 2 column   english=20
>> > > > then i need=20
>> > > > japanese=20
>> > > > chinese=20
>> > > > russian=20
>> > > > arabic=20
>> > > > spanish=20
>> > > > polish=20
>> > > > french=20
>> > > > german=20
>> >=20
>> > > > I'll repeat once again. An automated script for doing so  would be=
=20
>> awesome.=20
>> >=20
>> > > > On Wednesday, March 7, 2012 2:47:17 AM UTC+4, ianek wrote:=20
>> >=20
>> > > > > I've created the list for you, but it was an ugly hack in bash. =
A=20
>> > > > > better way would be to create a database and import sentences.cs=
v=20
>> and=20
>> > > > > links.csv to it, and then write a very simple program instead of=
=20
>> > > > > hacking around with grep etc. But it would be more work of=20
>> course. And=20
>> > > > > maybe not faster, considering that import would take time.=20
>> >=20
>> > > > > Here you go:http://dl.dropbox.com/u/17805197/jbo-eng.csv=20
>> > > > > It's tab-seperated list, any spreadsheet program should read it.=
=20
>> >=20
>> > > > > As a by-product, I am able to produce such a list for any other=
=20
>> > > > > language available in tatoeba instantly, if anyone's interested.=
=20
>> >=20
>> > > > > mu'o mi'e ianek=20
>> >=20
>> > > > > On 6 Mar, 22:17, ianek <jane...@gmail.com> wrote:=20
>> >=20
>> > > > >
>> http://tatoeba.org/pol/download_tatoeba_example_sentenceshttp://tatoe...=
=20
>> >=20
>> > > > > > There are actually three columns: id, language, sentence, but=
=20
>> with=20
>> > > > > > some database-fu or script-fu or maybe even spreadsheet-fu you=
=20
>> can get=20
>> > > > > > what you want. Or maybe I'll hack it together in a while.=20
>> >=20
>> > > > > > mu'o mi'e ianek=20
>> >=20
>> > > > > > On 6 Mar, 15:19, gleki <gleki.is.my.n...@gmail.com> wrote:=20
>> >=20
>> > > > > > > I wanna export tatoeba databse into a simple spreadsheet wit=
h=20
>> two=20
>> > > > > columns.=20
>> > > > > > > One for English and another one for Lojban=20
>> >=20
>> > > > > > > Does anyone know how to do that ?
>
>

--=20
You received this message because you are subscribed to the Google Groups "=
lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to lojban+unsubscribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at http://groups.google.com/group/lojban?hl=3Den.
For more options, visit https://groups.google.com/groups/opt_out.


------=_Part_378_11126777.1369060020379
Content-Type: text/html; charset=ISO-8859-2
Content-Transfer-Encoding: quoted-printable

<br><br>On Wednesday, March 21, 2012 8:52:12 PM UTC+4, ianek wrote:<blockqu=
ote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left=
: 1px #ccc solid;padding-left: 1ex;">OK, I've made it.&nbsp;<a href=3D"http=
://dl.dropbox.com/u/17805197/parse-tatoeba.tar.gz" target=3D"_blank">http:/=
/dl.dropbox.com/u/<wbr>17805197/parse-tatoeba.tar.gz</a><div>Unpack it to a=
 directory with links.csv and sentences.csv from Tatoeba.</div><div>Run ./p=
repare-links.sh once. (You'll have to do it again only if you replace links=
/setences with newer files).</div><div>Then run ./make-pairs.sh [language-c=
ode] &gt; [some filename].csv</div><div>For example&nbsp;./make-pairs.sh en=
g &gt; jbo-eng.csv</div><div><br></div><div>I've made it so that it gathers=
 all of the interlinked sentences. This has some drawbacks. Do you know the=
 "phone game"? If you do, you know what I'm saying. If you don't, you will =
know when you look at some pairs...</div></blockquote><div><br></div><div><=
br></div><div>This is a great script. But can we have another one with only=
 direct translations to remove that broken phone game effect?</div><div>Als=
o can we have a script that will link indirect translations only a given (e=
.g. 1) level deep?</div><blockquote class=3D"gmail_quote" style=3D"margin: =
0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div><=
br></div><div>mu'o mi'e ianek<br><br>On Wednesday, March 7, 2012 7:36:44 PM=
 UTC+1, ianek wrote:<blockquote class=3D"gmail_quote" style=3D"margin:0;mar=
gin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex"><a href=3D"http=
://dl.dropbox.com/u/17805197/jbo-rus.csv" target=3D"_blank">http://dl.dropb=
ox.com/u/<wbr>17805197/jbo-rus.csv</a>
<br>
<br>But it's probably not complete, for the reason I mentioned.
<br>
<br>On 7 Mar, 19:32, ianek &lt;<a>jane...@gmail.com</a>&gt; wrote:
<br>&gt; I've just found out that links.csv is not complete, ie. it doesn't
<br>&gt; cover all the pairs. For example, we have a Lojban sentence "lo pu=
rci
<br>&gt; ka'e te djuno gi'e na ka'e se galfi .i lo balvi ka'e se galfi gi'e=
 na
<br>&gt; ka'e te djuno" and a Polish sentence "Przesz=B3o=B6=E6 mo=BFe by=
=E6 tylko
<br>&gt; poznana, nie zmieniona. Przysz=B3o=B6=E6 mo=BFe by=E6 tylko zmieni=
ona, nie
<br>&gt; poznana." and they're not linked to each other, but they both are
<br>&gt; linked to "The past can only be known, not changed. The future can
<br>&gt; only be changed, not known.". I wonder if there's a rule that such
<br>&gt; sentence always have a "common relative", it would certainly make
<br>&gt; things easier. But I think that now using a database (maybe sqlite=
3)
<br>&gt; would be necessary.
<br>&gt;
<br>&gt; mu'o mi'e ianek
<br>&gt;
<br>&gt; On 7 Mar, 15:51, ianek &lt;<a>jane...@gmail.com</a>&gt; wrote:
<br>&gt;
<br>&gt;
<br>&gt;
<br>&gt;
<br>&gt;
<br>&gt;
<br>&gt;
<br>&gt; &gt; What platform? Is Linux ok?
<br>&gt;
<br>&gt; &gt; On 7 Mar, 11:44, gleki &lt;<a>gleki.is.my.n...@gmail.com</a>&=
gt; wrote:
<br>&gt;
<br>&gt; &gt; &gt; I'm interested. And actually in periodically doing it my=
self. &nbsp;Not by
<br>&gt; &gt; &gt; request.
<br>&gt; &gt; &gt; Because the database is live and is being updated by us.
<br>&gt;
<br>&gt; &gt; &gt; Of course I know about those three files.
<br>&gt;
<br>&gt; &gt; &gt; For now, I'd prefer such export for several directions a=
t one (a
<br>&gt; &gt; &gt; multilingual spreadsheet).
<br>&gt; &gt; &gt; I want all sentences for which we have lojban translatio=
ns.
<br>&gt; &gt; &gt; i.e.
<br>&gt; &gt; &gt; first column &nbsp; &nbsp;lojban
<br>&gt; &gt; &gt; 2 column &nbsp; english
<br>&gt; &gt; &gt; then i need
<br>&gt; &gt; &gt; japanese
<br>&gt; &gt; &gt; chinese
<br>&gt; &gt; &gt; russian
<br>&gt; &gt; &gt; arabic
<br>&gt; &gt; &gt; spanish
<br>&gt; &gt; &gt; polish
<br>&gt; &gt; &gt; french
<br>&gt; &gt; &gt; german
<br>&gt;
<br>&gt; &gt; &gt; I'll repeat once again. An automated script for doing so=
 &nbsp;would be awesome.
<br>&gt;
<br>&gt; &gt; &gt; On Wednesday, March 7, 2012 2:47:17 AM UTC+4, ianek wrot=
e:
<br>&gt;
<br>&gt; &gt; &gt; &gt; I've created the list for you, but it was an ugly h=
ack in bash. A
<br>&gt; &gt; &gt; &gt; better way would be to create a database and import=
 sentences.csv and
<br>&gt; &gt; &gt; &gt; links.csv to it, and then write a very simple progr=
am instead of
<br>&gt; &gt; &gt; &gt; hacking around with grep etc. But it would be more =
work of course. And
<br>&gt; &gt; &gt; &gt; maybe not faster, considering that import would tak=
e time.
<br>&gt;
<br>&gt; &gt; &gt; &gt; Here you go:<a href=3D"http://dl.dropbox.com/u/1780=
5197/jbo-eng.csv" target=3D"_blank">http://dl.dropbox.com/u/<wbr>17805197/j=
bo-eng.csv</a>
<br>&gt; &gt; &gt; &gt; It's tab-seperated list, any spreadsheet program sh=
ould read it.
<br>&gt;
<br>&gt; &gt; &gt; &gt; As a by-product, I am able to produce such a list f=
or any other
<br>&gt; &gt; &gt; &gt; language available in tatoeba instantly, if anyone'=
s interested.
<br>&gt;
<br>&gt; &gt; &gt; &gt; mu'o mi'e ianek
<br>&gt;
<br>&gt; &gt; &gt; &gt; On 6 Mar, 22:17, ianek &lt;<a>jane...@gmail.com</a>=
&gt; wrote:
<br>&gt;
<br>&gt; &gt; &gt; &gt;<a href=3D"http://tatoeba.org/pol/download_tatoeba_e=
xample_sentenceshttp://tatoe." target=3D"_blank">http://tatoeba.org/pol/<wb=
r>download_tatoeba_example_<wbr>sentenceshttp://tatoe.</a>..
<br>&gt;
<br>&gt; &gt; &gt; &gt; &gt; There are actually three columns: id, language=
, sentence, but with
<br>&gt; &gt; &gt; &gt; &gt; some database-fu or script-fu or maybe even sp=
readsheet-fu you can get
<br>&gt; &gt; &gt; &gt; &gt; what you want. Or maybe I'll hack it together =
in a while.
<br>&gt;
<br>&gt; &gt; &gt; &gt; &gt; mu'o mi'e ianek
<br>&gt;
<br>&gt; &gt; &gt; &gt; &gt; On 6 Mar, 15:19, gleki &lt;<a>gleki.is.my.n...=
@gmail.com</a>&gt; wrote:
<br>&gt;
<br>&gt; &gt; &gt; &gt; &gt; &gt; I wanna export tatoeba databse into a sim=
ple spreadsheet with two
<br>&gt; &gt; &gt; &gt; columns.
<br>&gt; &gt; &gt; &gt; &gt; &gt; One for English and another one for Lojba=
n
<br>&gt;
<br>&gt; &gt; &gt; &gt; &gt; &gt; Does anyone know how to do that ?</blockq=
uote></div></blockquote>

<p></p>

-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;lojban&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to lojban+unsubscribe@googlegroups.com.<br />
To post to this group, send email to lojban@googlegroups.com.<br />
Visit this group at <a href=3D"http://groups.google.com/group/lojban?hl=3De=
n">http://groups.google.com/group/lojban?hl=3Den</a>.<br />
For more options, visit <a href=3D"https://groups.google.com/groups/opt_out=
">https://groups.google.com/groups/opt_out</a>.<br />
&nbsp;<br />
&nbsp;<br />

------=_Part_378_11126777.1369060020379--