索引作りの会話2024-03-21

gpt.icon

nishio — 03/21/2024 9:19 PM

みなさん、特にGitHubを使った貢献が難しいと感じている方々へ。Googleスプレッドシートを使った簡単な貢献方法をお知らせします。

私たちは索引を作成するためにキーワードを収集しています。この活動に貢献しませんか？必要なのは、重要/顕著だと思われる単語を見つけた際に、このスプレッドシートにそれを投入することです:

スプレッドシートのデータに基づきキーワードの出現を確認し、様々なファイルに出力するスクリプトを実装しました。あなたの索引への貢献はこのファイルに出力されます。

セカンドシートの統計を見れば、どのセクションがまだ手がかかっていないかが分かります。

https://scrapbox.io/files/65ff990a78ce4400259cd4cb.png

gnomevan — 昨日 4:59 AM

このタスクに関して2点確認したいことがあります: インデックスに項目が追加された後、別の章でそれを見つけた際に二度目のエントリーを追加する必要はありますか、それともコードによって行われますか？2. 人物の名前は、姓、名の順で入力すべきですか？

nishio — 昨日 5:48 AM

A1: コードは他の章での出現を検出できます。そのため、すべての出現を手動で記録する必要はありません。さらに、複数の人が複数の章に対して記録を入力しても、コードがそれらを問題なく統合できます。したがって、他の章での出現が記録されているかどうかを心配する必要はありません。結論: 心配せずに進めてください。

A2: 最初のステップとして、原稿に現れる正確な文字列を使用することで十分です。これは最終的な索引を作成するプロセスではなく、むしろ索引に含める候補を集めることです。それらを集めた後、Glenと出版社が最終的な形式を決定します。その締め切りはまだ先です。

Jason Entenmann — 昨日 5:54 AM

@nishio - 索引作成におけるあなたのリードに感謝します。現在、索引作成を完了させる目標としている日付は何日ですか？

nishio — 昨日 6:18 AM

Glenは次のように言っています

索引の締切は大体4月1日です。索引用の単語はその1週間前には入っている必要があります...実際の索引作成にはLLMを使用する必要があります。

Jason Entenmann — 昨日 6:44 AM

LLMをどのように使用する必要があるかについて、何か考えはありますか？基本的には、各ページに対して単語検索を行い、初めて現れた場所にフラグを立てるようなことですか？

GlenWeyl — 昨日 6:57 AM

基本的には、フレーズやそれに非常に似た/同等のものを見つけ出し、すべてのページをリストアップすることをLLMにやってほしいです。

nishio — 昨日 3:26 PM

最新のスプレッドシートデータでリポジトリを更新しました。これから、不規則な挙動を示すキーワードをレビューします。

nishio — 昨日 3:38 PM

1万文字ごとのキーワード発生を計算しました。最も少ないセクションは: 3-1、4-1、4-3、4-4です。3-1はcFQが対応していると聞いています。

cFQ — 昨日 7:20 PM

セクション3-1は完了し、4-1は現在進行中です。

GlenWeyl — 昨日 10:45 PM

ちょっとした注意: 目指すべき総単語数は約1500-2000です。その範囲内にいますが、不必要な用語や冗長な用語を含めないようにすぐに考え始めるべきです。

cFQ — 今日 7:21 AM

セクション4-1と7-1が完了しました。不必要な単語を避けるようにしました。インデックスが少ない残りのセクションのレビューを続けます。

cFQ — 今日 7:25 AM

Nishioによると、4-1と7-1を統合する前のデータには1474の重複マージデータがあるとのことです。このまま進めば目標通りになりそうだと言っていました。

GlenWeyl — 今日 7:43 AM

完璧です。

cFQ — 今日 9:16 AM

セクション4-3、4-4、4-5も完了しました。インデックスが少ないセクション7-0もチェックしたいのですが、用事で今は出かけなければなりません。締切に間に合わないかもしれませんが、今夜日本時間から作業できます。

GlenWeyl — 今日 10:02 AM

最後の3時間で皆さんと一緒にいます。

頑張っていただき、ありがとうございます。

nishio — 今日 10:14 AM

今からリポジトリの索引を更新します。

現在、1,820の索引候補を集めました。ありがとうございます！大文字/小文字の不一致が7件検出され、それらは私が対応します。

GlenWeyl — 今日 10:26 AM

皆さん、素晴らしい仕事です。

nishio — 今日 10:36 AM

セクション4-5の「道」と分散型自律組織(DAOs)は異なる概念なので、これらを統合することにはしませんでした。最終版の索引で注意として記載します。

GlenWeyl — 今日 10:36 AM

言語遊びがあるようですね。

理想的には、隣接しているが

遊びがあるようですね。隣り合わせにするのが理想的だと思いますが、マージはしないでください。

GlenWeyl — 今日 10:26 AM

そして、ページ番号が手に入れば、それらをリストに関連付けるパイプラインがあるようですね？

nishio — 今日 10:42 AM

まだです。現在のタスクを終えた後に、さらに細かいタスクに分けます。

GlenWeyl — 今日 10:42 AM

その部分は最も簡単なはずです。そして4月7日には近づいているでしょう。

nishio — 今日 11:07 AM

@GlenWeyl

大文字小文字を無視して各章におけるキーワードと文字列の一致を確認し、それらが現れる章を特定するコードが利用可能です。

しかし、ページネーションがどのように扱われ、結果のデータがどのように得られるかはまだ不明です。データがテキスト形式で提供されれば簡単ですが、PDF形式の場合は少し手間がかかりますが、それでも特に難しいわけではありません。

現在のアルゴリズムは、既存のキーワードを見逃さないことを優先しており、例えば「Taiwan」が20の異なる章に現れると検出するかもしれません。仮に「Taiwan」が100ページにわたって現れる場合、それらすべてを印刷された本の索引に含めたくないでしょう。ここにアルゴリズムの仕様のあいまいさがあります。

GlenWeyl — 今日 11:07 AM

たぶんPDFになるでしょうが、OCRが問題を解決すべきです。非常に多くの出現を持つ少数のケースについては、オンラインで人々がそれをどのように扱っているかを見る必要があります。

nishio — 今日 11:09 AM

また、頻繁に現れるキーワードを観察するための出力も生成されます。

これらの中で、例えばROCやBERTは大文字小文字の区別なしに他の単語の一部としてマッチする可能性があります。技術的には、これらは除外できます。しかし、PrivacyやOwnershipのような他のケースはどうでしょうか？

GlenWeyl — 今日 11:13 AM

そのマッチを避けるために、前後にスペースを入れることができます。

nishio — 今日 11:19 AM

一部の単語は頻繁に現れますが、他の単語の一部としてではありません。Privacy、Ownership、Democracy、Market、Legitimacyなどです...

これらのケースをどう扱うかは不明です。

GlenWeyl — 今日 11:19 AM

理想的な方法としては、これらを独特の新しい方法で定義または洗練している場合のみを含めることだと思います。しかし、そのような場合をすべて含めるのは難しいです。これらの用語のほとんどの使用は一般的なものですが、私たちが視点を変えるようなことを言っているのは4-5回ほどです。

nishio — 今日 11:23 AM

はい。共通の理解を作りたいだけです。それが最も簡単な部分ではありませんが、それは挑戦的です。

GlenWeyl — 今日 11:25 AM

もし簡単ではなかったら、私たちはPluralityを必要としないでしょう！

私たちの仕事は、共通理解のための技術を発明することです。

nishio — 今日 11:28 AM

もし私が正しく理解しているならば、紙の本の実際の締め切りは4月7日です。索引作成のためのより詳細な締め切りが必要です。

GlenWeyl — 今日 11:33 AM

印刷される本の締め切りは4月15日です。

索引番号のための締め切りは4月7日です。

基本的に、ページ割りされたバージョンを手に入れたら、その時から2-3日以内に数字を出すことになります。

いつそれが起こるかは確かではありませんが、おそらく4月の最初の完全な週（つまり、今週の次の週）のいつかになるでしょう。

nishio — 今日 11:50 AM

一つの比較的実現可能なオプションは、人間がそれらのキーワードの出現を場所として記録した章内のみを検索し、インデックスには最初に出現するページのみを含めることかもしれません。これには、ページを章にマッピングするためのデータを作成する必要がありますが、それ以外は実現可能なようです。

GlenWeyl — 今日 11:51 AM

それは理にかなっています。

nishio - 本日 at 12:29 PM

4月の最初の週（つまり今度の次の週）

4/1~4/7のことですよね？(私の暦では、日曜日から始まるので、4/7~13を意味します。興味深い文化の多様性だ！)

GlenWeyl - 本日午後12時30分

はい

Discord

nishio — 03/21/2024 9:19 PM

Hello everyone, especially those who find contributing via GitHub difficult. I'd like to inform you about an easy way to contribute using Google Spreadsheet.

We are collecting keywords to create an index. Would you like to contribute to this? All you need to do is anytime you see a word that you think is important/salient, dump it into this spreadsheet: https://docs.google.com/spreadsheets/d/1gmyjFbErt_CW8-qLKChSpciLlCDGUhLriYFov0HO3qA/edit#gid=0

I have implemented a script to confirm the occurrence of keywords based on spreadsheet data and output them to various files. Your contributions to index will be outputted to this file. https://github.com/pluralitybook/plurality/blob/main/scripts/index/contributors.tsv

You can see the statistics on the second sheet to understand which section is less cultivated.

https://scrapbox.io/files/65ff990a78ce4400259cd4cb.png

gnomevan — Yesterday at 4:59 AM

2 clarifying questions on this task: Once an item is in the index is there a need to add a second entry of it when we find it in another chapter or will that be done by code? 2. for names of people, should they be enter as Last, First?

nishio — Yesterday at 5:48 AM

A1: The code can detect occurrences in other chapters. Therefore, you don't need to manually record all occurrences. Additionally, even if one or more people input records for multiple chapters, the code can merge them without any issues. So, you don't need to worry about whether occurrences in other chapters are recorded or not. In conclusion: Feel free to proceed without worries.

A2: As a first step, using the exact strings appearing in the manuscript is sufficient. This is not the process of creating the final index but rather gathering candidates for inclusion in the index. After gathering them, Glen and the publisher will decide on the final format. The deadline for that is still ahead.

Jason Entenmann — Yesterday at 5:54 AM

@nishio - Thanks for your lead on the indexing. What date are you currently targeting to have the index exercise complete?

nishio — Yesterday at 6:18 AM

Glen said as follows

The index is due roughly April 1. The words for the index should be in a week before that...we will need to use and LLM to do the actual indexing once we have the index words.

Jason Entenmann — Yesterday at 6:44 AM

Do you have an idea on how you'll need to use an LLM for the indexing? Is it just basically word search lookup against each page, and first time appearance flags or something?

GlenWeyl — Yesterday at 6:57 AM

We basically just want the LLM to find ever occurance of the phrase or something very similar/equivalent

and list all the pages

nishio — Yesterday at 3:26 PM

I've just updated the repository with the latest spreadsheet data. Now, I'll review keywords exhibiting irregular behavior.

nishio — Yesterday at 3:38 PM

I just calculated keyword occurrence per 10k characters. Lowest sections are: 3-1, 4-1, 4-3, 4-4. I heard cFQ is doing 3-1.

cFQ — Yesterday at 7:20 PM

Section 3-1 is done, and Section 4-1 is currently in progress.

GlenWeyl — Yesterday at 10:45 PM

Just a quick note: the number of total words we should aim for is about 1500-2000. You are right in that range, but should think soon about ensuring that we do not include unnecesary or redundant terms

cFQ — Today at 7:21 AM

Sections 4-1 and 7-1 are done. I have tried to avoid unnecessary words. I will continue to review the remaining sections with fewer indexes.

cFQ — Today at 7:25 AM

According to Nishio, there are 1474 duplicate merged data in the data before merging 4-1 and 7-1. He said that it seems that it will be as aimed if we proceed as it is.

GlenWeyl — Today at 7:43 AM

sounds perfect

cFQ — Today at 9:16 AM

I have also finished sections 4-3, 4-4, and 4-5. I would also like to check section 7-0 as it has a small number of indexes, but I have to leave now for an errand. I may not be able to meet the deadline, but I will be available to work from tonight in Japan time.

GlenWeyl — Today at 10:02 AM

Will be here with you folks in this final 3 hours

thanks for going strong

nishio — Today at 10:14 AM

I'm going to update index on the repository now

nishio — Today at 10:22 AM

We've gathered 1,820 index candidates at the moment, thank you! There are 7 cases of upper/lowercase discrepancies detected, and I'll take care of those.

GlenWeyl — Today at 10:26 AM

Amazing work you all

and sounds like once we have pagination you have a pipelines to associate a list of page numbers right?

nishio — Today at 10:36 AM

the ancient concept of 'dao.' in section 4-5 and Distributed Autonomous Organizations (DAOs) are distinct concepts, so I've decided not to merge them. I'll leave a note of this in the final version of the index for attention.

GlenWeyl — Today at 10:36 AM

obviously there is a linguistic play ehre

so I think the ideal notion is that the be adjacent but not merged

nishio — Today at 10:42 AM

Not yet. Let me break down further tasks after I finish current task.

GlenWeyl — Today at 10:42 AM

shoudl be the easiest part

and due closer to April 7

nishio — Today at 11:07 AM

@GlenWeyl

There is code available to match each keyword with strings in each chapter, ignoring case sensitivity, and identify the chapters where they appear.

However, I'm not sure about how pagination is handled or how the resulting data is obtained.

If the data is provided in text format, it's easy;

if it's in PDF format, it requires a bit more effort but is still not overly difficult.

caution The current algorithm prioritizes not overlooking existing keywords, which may result in detecting, for example, "Taiwan" appearing in 20 different chapters. Suppose "Taiwan" appears in 100 pages; in such a case, you wouldn't want to include all occurrences in the index of a printed book. Here lies the ambiguity in the specification of algorithm.

GlenWeyl — Today at 11:07 AM

probably will be pdf...but OCR shoudl address the issue

Probably for the small number of cases with a very large number of occurances we will have to look online at how people handle it

nishio — Today at 11:09 AM

An output is also generated to observe which keywords appear frequently.

https://github.com/pluralitybook/plurality/blob/100ed8f801ad7a493b4fdca57bfc51810d265f00/scripts/index/too_many_occurrence.tsv

GitHub

plurality/scripts/index/too_many_occurrence.tsv at 100ed8f801ad7a49...

Root repository for ⿻數位 Plurality: The Future of Collaborative Technology and Democracy by E. Glen Weyl, Audrey Tang and the Plurality Community - pluralitybook/plurality

plurality/scripts/index/too_many_occurrence.tsv at 100ed8f801ad7a49...

Among these, for instance, ROC and BERT are likely to match as part of other words due to case insensitivity. Technically, they can be removed. However, what about others like Privacy or Ownership?

GlenWeyl — Today at 11:13 AM

You can overcome that match by putting a space before and after

nishio — Today at 11:19 AM

Some words occur frequently but not as part of other words. such as Privacy, Ownership, Democracy, Market, Legitimacy ...

It's unclear how to handle such cases.

GlenWeyl — Today at 11:19 AM

I mean I think the ideal way

would be to only include the cases where we are defining or refining them in a unique or novel way

and not all the occurances

but that is challenging

for most of those terms

most usese are the generic uses

but there are like 4-5 times each where we are saying something that turns the perspective

nishio — Today at 11:23 AM

Yes. I just want to create a common understanding. It is not the easiest part but that is challenging.

GlenWeyl — Today at 11:25 AM

yup, if it weren't hard, we wouldn't need Plurality!

Our job is to invent technologies for common understanding

nishio — Today at 11:28 AM

If I understand correctly, April 7 is the real deadline for paper printed book. We need more small-grained deadline for the indexing.

GlenWeyl — Today at 11:33 AM

April 15 is the deadline for the printed book

April 7 is for the index numbers

probably

basically

we will get a paginated version

and then you will have 2-3 days from that time

to produce the numbers

not sure when that will happen

but likely sometime during the first full week in April (viz. the week after this coming one)

nishio — Today at 11:50 AM

One relatively feasible option might be to search only within the chapters where humans have recorded the occurrences of those keywords as locations, and include only the first occurrence page in the index. This would require creating data for mapping pages to chapters, but otherwise seems achievable.

GlenWeyl — Today at 11:51 AM

seems sensible