Return-path: Envelope-to: lojban-list-archive@lojban.org Delivery-date: Tue, 14 Jun 2022 12:33:31 -0700 Received: from mail-ed1-f57.google.com ([209.85.208.57]:51190) by d05c5d92ccf2 with esmtps (TLS1.3) tls TLS_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1o1CI5-002Swo-H5 for lojban-list-archive@lojban.org; Tue, 14 Jun 2022 12:33:31 -0700 Received: by mail-ed1-f57.google.com with SMTP id eh10-20020a0564020f8a00b0042dd9bf7c57sf6839898edb.17 for ; Tue, 14 Jun 2022 12:33:29 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1655235208; cv=pass; d=google.com; s=arc-20160816; b=veS3zR3/20O6fO1/ID8XSlZWfQ14iwSG957/Y9csgKovQ5DnxRzejftsjR9RewcXcq mvi/6fJvb4DxR3Ipm1+FmYRYbcvlvhUIS1wzrxWD5snGWWts33PonX/ifka5+cHPbr5m 6Y38lV/yPyXA489sgxS7TG4UUANSM01G0C2IZuv2AZRCPGAnv4opEc7ec9aHyhcPlW/d sI5AysDOnLsjBFZPjRRIU6nQUVOj8GfiYzAe5MhxBgNXaYLBhqvxMBMKhLzGi9+6yIeM RcjVXmFRaj/oh2sNercyHsgG430oiShJIXQPqYRHBxyTG+AV+4kfJ2Sx6gBgavOWLUV6 CutQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-unsubscribe:list-subscribe:list-archive:list-help:list-post :list-id:mailing-list:precedence:reply-to:mime-version:message-id :in-reply-to:date:references:subject:to:from:sender:dkim-signature; bh=yBDHK0Cj+Nnb16FovlugTk/RhmHxqpF+wFeoINMHxbg=; b=y12pTUVs9u5lyba5AZNKf/zLVzgs00J1s9j4UJA4kVvXMl9t0KiKDMDlEueFj9zHLQ TWEuU4vC07baTaDLdm8pjbMC+ZIPOEPlFvqXw2kxJ/TJ00o4olVd0zTedyYJjxjiciab eRkcME854wcIE5PDad+kBjGIoBYGpjh89BAZk05cyQn8MIUUpkr6obIuJozZdX3Else3 p8/R9DAzKQREXKBxC+TRA40I4UCV04C6Dtj8RLzgH1f4bwwIz7z8nW1SjtuSmaAvW6T/ kI6cTMlighGawMKLERQ57nNHTBFTripxStxZjf5WSIhjdA8dIIeK7oO+cp5P7zBNYsbs gD2Q== ARC-Authentication-Results: i=2; gmr-mx.google.com; spf=pass (google.com: domain of scope845hlang343jbo@icebubble.org designates 2607:f2f8:a1d8::b19:0:f0b as permitted sender) smtp.mailfrom=scope845hlang343jbo@icebubble.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20210112; h=sender:from:to:subject:references:date:in-reply-to:message-id :mime-version:x-original-sender:x-original-authentication-results :reply-to:precedence:mailing-list:list-id:list-post:list-help :list-archive:list-subscribe:list-unsubscribe; bh=yBDHK0Cj+Nnb16FovlugTk/RhmHxqpF+wFeoINMHxbg=; b=YUePi2mnZTsnHd1NoV3vCmSzUzGf9AK1Liw80qBFDDq5IHgYmAj+VE0bI0/RNzeR+x ydU8Mi2195qpSlMxxHfjI6vH/SsyRCj3oyrqJD/WyXQ97IhMyg0v1B8Y9AOKLgzU1P3t 8pa6GvMiWsIis51RtDbrbCk7w9vEhOHy4aDE/ghyH1jkZdOSOv8NbmRwEx5bniVYMXZD KUJPQJDQIHpo66825r+7HHEZqnLBb7OGq6EBPXzOuqwUxUfysHnKM0ZSbPPYjKydcQlW J1o8OpXWR7PJ9F2U/1JifckNGE4MzKzYfG13nM5RMuj9+/U3EZf71QcSRd+6Dd+26nSK in4g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=sender:x-gm-message-state:from:to:subject:references:date :in-reply-to:message-id:mime-version:x-original-sender :x-original-authentication-results:reply-to:precedence:mailing-list :list-id:x-spam-checked-in-group:list-post:list-help:list-archive :list-subscribe:list-unsubscribe; bh=yBDHK0Cj+Nnb16FovlugTk/RhmHxqpF+wFeoINMHxbg=; b=4tmPPhWmltv11SkQjcX3Km5rrqkBFqPIiefyPmAw7ByQoNCuNVJHuSCms7u3Lmh35o eMUXM5VZiPFMfRMv3muYYdPjFaFSEFoPzY0hd7fqJe4h1RpdSp6pdIcdxVDD8N/U038T e5a1CcGfR0/CF7E8I1JADyQEGAMuFHcuSwTXhD9jmiyDkkw43ppJZFIHvDzDOsybHhmO HnszHOm2gpX3pwDqRaUCCWtxtHl5f/ZMR+TRp+TM/U44XvkJ2ACe+CT+6MxEnng+BxRC l+qVE9ISkPs3Qbl6pTRb8nW6o7aZhboDVZF24/XGCEY+iPWRurhq8t7ZDfjkTWjdBTAH 2mnQ== Sender: lojban@googlegroups.com X-Gm-Message-State: AOAM532eOAGb6Miq+R0anW1mLlJ/OLrU8JD9uXAlf+mA3JTRrwdmiOvH aOBuRtiQyKvuHke7Z8517KA= X-Google-Smtp-Source: ABdhPJy7cJRuwGvn/wjuaTaPpxv9TJMZu++in/gZ4DJX+fnisCbJjmntfue85Ti1K5kiucMO/pO0uA== X-Received: by 2002:a17:906:7d83:b0:6ce:fee:9256 with SMTP id v3-20020a1709067d8300b006ce0fee9256mr5819532ejo.647.1655235207852; Tue, 14 Jun 2022 12:33:27 -0700 (PDT) X-BeenThere: lojban@googlegroups.com Received: by 2002:a17:906:6a26:b0:70a:1088:57d1 with SMTP id qw38-20020a1709066a2600b0070a108857d1ls1614914ejc.3.gmail; Tue, 14 Jun 2022 12:33:25 -0700 (PDT) X-Received: by 2002:a17:906:779a:b0:715:790f:715c with SMTP id s26-20020a170906779a00b00715790f715cmr5458755ejm.707.1655235205357; Tue, 14 Jun 2022 12:33:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1655235205; cv=none; d=google.com; s=arc-20160816; b=XBpzzp6lf/SUU43cqaxK2NIzTjbMai8G7Bode69hXxqvnNZP/2eyhdwA0hPSQ/EWNF dvFbjWLDDvINK6N113lsJ+zXfsISrZqZ79MZSeY85OGQ52HufB/yJSwg22mgTcgsgeJ0 J5oiSp2YuX144tAC2OD60zZ8GHwXg4ZL5tJKKE5ujZIWurew+FkY+IdSE+6uxDvle7nQ E/yEYrciqO1mXojNa6Lug8zxGqjXYgi1lHuoFqsyzwGYZY8o7oF2Vj/9rQwLJ6F7gtNQ AMDaLAxDJF/E6t4eVmi41IdYKPnogo62FsUnCr81MikDgOmf1ykssVLFMGEqO/VSCHj/ F6ag== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:message-id:in-reply-to:date:references:subject:to:from; bh=Ei1YOsIuaIyW2Crzi+e4uvb2PbNfcdO8M1VUWPU03nE=; b=pSUCELlucXTG1c+SzOPDwpDOHuk4i9iFmjyH4dq3t47c31tyG4Ld7A3sw3y7T0nDCA KS+xHidT7vt0AMXhdxQUg+Nt6l3KA7yRXtjZZ79TXwsO9SdV7mGBtYm3/EhSpRO35w9P 8ZUOMuP6ffDd81ItLO9pycJzRs/pADg0IUjJXeBpRTQNu1csfJk8KrSBk2UR9I/XzJ47 danOmKk+oQm8b7zIi5ydQs3SaUorc+xZYQq2d+FmZMbXohGvybXpcveUZNmODDG6BlbM CdMio5Hcn39T9hPQnW5W7DVXEJm7jftkvsJ6jkTntXXD19tVntF3xDQ7OKJ7o2LgdsBt pAMg== ARC-Authentication-Results: i=1; gmr-mx.google.com; spf=pass (google.com: domain of scope845hlang343jbo@icebubble.org designates 2607:f2f8:a1d8::b19:0:f0b as permitted sender) smtp.mailfrom=scope845hlang343jbo@icebubble.org Received: from icebubble.org (smtp.icebubble.org. [2607:f2f8:a1d8::b19:0:f0b]) by gmr-mx.google.com with ESMTPS id q24-20020aa7d458000000b0042d687c85d2si475194edr.0.2022.06.14.12.33.25 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 14 Jun 2022 12:33:25 -0700 (PDT) Received-SPF: pass (google.com: domain of scope845hlang343jbo@icebubble.org designates 2607:f2f8:a1d8::b19:0:f0b as permitted sender) client-ip=2607:f2f8:a1d8::b19:0:f0b; Received: from petunia by icebubble.org with local-bsmtp (Exim 4.76) (envelope-from ) id 1o1CFR-0008SO-Qs for lojban@googlegroups.com; Tue, 14 Jun 2022 19:30:45 +0000 Received: from rusat by cmarib.ramside with local (Exim 4.72) (envelope-from ) id 1o180a-0004f0-8o for lojban@googlegroups.com; Tue, 14 Jun 2022 14:59:08 +0000 From: scope845hlang343jbo@icebubble.org To: lojban@googlegroups.com Subject: [lojban] Re: Lojban tokenizer for machine learning, first version References: Date: Tue, 14 Jun 2022 14:58:57 +0000 In-Reply-To: (Oleg Parashchenko's message of "Sun, 12 Jun 2022 01:32:55 -0700 (PDT)") Message-ID: <86letzxpe6.fsf@cmarib.ramside> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" X-Original-Sender: scope845hlang343jbo@icebubble.org X-Original-Authentication-Results: gmr-mx.google.com; spf=pass (google.com: domain of scope845hlang343jbo@icebubble.org designates 2607:f2f8:a1d8::b19:0:f0b as permitted sender) smtp.mailfrom=scope845hlang343jbo@icebubble.org Reply-To: lojban@googlegroups.com Precedence: list Mailing-list: list lojban@googlegroups.com; contact lojban+owners@googlegroups.com List-ID: X-Spam-Checked-In-Group: lojban@googlegroups.com X-Google-Group-Id: 1004133512417 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , X-Spam-Score: -7.8 (-------) X-Spam_score: -7.8 X-Spam_score_int: -77 X-Spam_bar: ------- Oleg Parashchenko writes: > I've just released the first version of a lojban tokenizer. It is intended > for use in machine learning applications and therefore is a bit different > from a linguistic tokenizer. In particular, it does sub-word tokenization. > > Additionally, there is a lexer, which can be used to develop alternative > tokenizers. .uanai How is that different from any of the other Lojban parsers that have been written? I am interested in your lexer, however. Which version of the grammar did you use? The PEG? I'd be very curious to see how your lexer distinguishes between lujvo and fu'ivla. -- You received this message because you are subscribed to the Google Groups "lojban" group. To unsubscribe from this group and stop receiving emails from it, send an email to lojban+unsubscribe@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/lojban/86letzxpe6.fsf%40cmarib.ramside.