Received: from mail-pb0-f62.google.com ([209.85.160.62]:42075) by stodi.digitalkingdom.org with esmtps (TLSv1:RC4-SHA:128) (Exim 4.76) (envelope-from ) id 1UKd3U-0007Pe-UJ; Tue, 26 Mar 2013 16:13:37 -0700 Received: by mail-pb0-f62.google.com with SMTP id jt11sf389499pbb.17 for ; Tue, 26 Mar 2013 16:13:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20120806; h=x-received:x-beenthere:x-received:received-spf:date:from:to:subject :message-id:mail-followup-to:references:mime-version:in-reply-to :user-agent:x-original-sender:x-original-authentication-results :reply-to:precedence:mailing-list:list-id:x-google-group-id :list-post:list-help:list-archive:sender:list-subscribe :list-unsubscribe:content-type:content-disposition; bh=8CMoPUXueFIjDBsexFlO1elbI0Hgog7DBPwl2+Vjc9Y=; b=j/MQ2kWIqcIS2cS+X4iAsWE6WdmGZch/xPkEDABHoiW9jpGD9w2YTW6IogdOK1YvCc jMK5R5VxHYtuXHeZUPIB/MIVOjG2OkovNO3Fy80yGFLk1rutATDULACifBaxjJh10aSi xdNaToruw+cP7APp5l4KFjYIrbVwGQX7SQVRgod2K0ktEw8kAvFIvJjV4fxisDQaHynu TrzbaBEh7EY6LwZWEnAY+oL2U9lSlSVFCspdOrTuP/EeM6pMhgWGtUc48nhZpi0iFDC9 6U/COJtEzVgeRarZd2uiKU55qDPxItYwZmtiFZ/IGepaxnuk7jDsEG+RXiNoBmnetDBE n+nw== X-Received: by 10.50.217.225 with SMTP id pb1mr884030igc.5.1364339598444; Tue, 26 Mar 2013 16:13:18 -0700 (PDT) X-BeenThere: lojban@googlegroups.com Received: by 10.50.95.198 with SMTP id dm6ls573926igb.16.canary; Tue, 26 Mar 2013 16:13:17 -0700 (PDT) X-Received: by 10.66.119.43 with SMTP id kr11mr2314882pab.4.1364339597382; Tue, 26 Mar 2013 16:13:17 -0700 (PDT) Received: from stodi.digitalkingdom.org (mail.digitalkingdom.org. [173.13.139.236]) by gmr-mx.google.com with ESMTPS id xb6si510148pab.0.2013.03.26.16.13.17 (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 26 Mar 2013 16:13:17 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of rlpowell@digitalkingdom.org designates 173.13.139.236 as permitted sender) client-ip=173.13.139.236; Received: from rlpowell by stodi.digitalkingdom.org with local (Exim 4.76) (envelope-from ) id 1UKd3M-0007Pb-7I for lojban@googlegroups.com; Tue, 26 Mar 2013 16:13:16 -0700 Date: Tue, 26 Mar 2013 16:13:16 -0700 From: Robin Lee Powell To: lojban@googlegroups.com Subject: Re: [lojban] Need some jbovlaste programming help. Message-ID: <20130326231316.GH22685@stodi.digitalkingdom.org> Mail-Followup-To: lojban@googlegroups.com References: <20130325180820.GU6328@stodi.digitalkingdom.org> <3520631.PfPziTd2d6@caracal> MIME-Version: 1.0 In-Reply-To: <3520631.PfPziTd2d6@caracal> User-Agent: Mutt/1.5.21 (2010-09-15) X-Original-Sender: rlpowell@digitalkingdom.org X-Original-Authentication-Results: gmr-mx.google.com; spf=pass (google.com: best guess record for domain of rlpowell@digitalkingdom.org designates 173.13.139.236 as permitted sender) smtp.mail=rlpowell@digitalkingdom.org Reply-To: lojban@googlegroups.com Precedence: list Mailing-list: list lojban@googlegroups.com; contact lojban+owners@googlegroups.com List-ID: X-Google-Group-Id: 1004133512417 List-Post: , List-Help: , List-Archive: Sender: lojban@googlegroups.com List-Subscribe: , List-Unsubscribe: , Content-Type: text/plain; charset=ISO-8859-1 Content-Disposition: inline X-Spam-Score: 0.0 (/) X-Spam_score: 0.0 X-Spam_score_int: 0 X-Spam_bar: / On Tue, Mar 26, 2013 at 09:15:47AM -0400, Pierre Abbat wrote: > On Monday, March 25, 2013 11:08:21 Robin Lee Powell wrote: > > Some bad data has snuck its way in to jbovlaste (a good chunk from > > an import script I screwed up that we can't just re-run, but some of > > it isn't from that, so not sure what's going on) and it needs > > cleaning. > > > > I have neither the time nor inclination. > > > > I don't much care what it's written in as long as it's UTF-8 safe > > (i.e. bash isn't going to cut it), but we need something that does > > the following: > > > > For every natlang word: > > > > if a duplicate (same word, meaning, and langid) exists, > > consolidate them. This means deleting the duplicate, combining > > the "notes" field for the two of them, and updating all instances > > of the id you just deleted to point to the one that still exists > > in the tables threads, keywordmapping, natlangwordbestguesses, and > > natlangwordvotes. natlangwordbestguesses has to be handled > > specially there, as it shouldn't end up with two identical rows > > (identical across all 3 fields); that shouldn't be possible given > > that manipulation, but check anyway. > > Is PHP okay? All the database programming I've done is in PHP, > except a tiny bit that's in bash. It's better than it not getting done. :) You will want to read http://tympanus.net/codrops/2009/08/31/solving-php-mysql-utf-8-issues/ and http://stackoverflow.com/questions/60174/how-to-prevent-sql-injection-in-php , noting that the database postgres, not MySQL. > Can I (or anyone else who's going to do this) have a copy of the > database to practice on? I don't want to delete from the live > database a word by mistake. Except for the users table you certainly can, yes. http://users.digitalkingdom.org/~rlpowell/media/public/jbovlaste.20130326.sql.gz If you need a scrubbed users table, let me know. The problems are mostly in Russian, due to an import script I fucked up. You can see the issue thuswise: select word,meaning,langid from natlangwords where word in (select word from natlangwords group by word, meaning, langid having count(*) > 1) order by langid; That should never, ever happen. There are also a bunch of natlangwords, some related to this bug and some not, which aren't in use at all. -Robin -- http://intelligence.org/ : Our last, best hope for a fantastic future. .i ko na cpedu lo nu stidi vau loi jbopre .i danfu lu na go'i li'u .e lu go'i li'u .i ji'a go'i lu na'e go'i li'u .e lu go'i na'i li'u .e lu no'e go'i li'u .e lu to'e go'i li'u .e lu lo mamta be do cu sofybakni li'u -- You received this message because you are subscribed to the Google Groups "lojban" group. To unsubscribe from this group and stop receiving emails from it, send an email to lojban+unsubscribe@googlegroups.com. To post to this group, send email to lojban@googlegroups.com. Visit this group at http://groups.google.com/group/lojban?hl=en. For more options, visit https://groups.google.com/groups/opt_out.