ppoi - NISHIO Hirokazu's Scrapbox (Auto-translated from Japanese)

ppoi

ppoi(ish)

This is a library that supports machine learning to easily make similar judgments in situations where it is not possible to clearly describe the conditional expression of an if statement, but good and bad examples can be provided.

Github

Since the main script is __init__.py, the checked out directory can be imported directly as a Python module. You can rename it to an appropriate, easy-to-understand name. For example, if you want to make it judge whether it looks like a keyword or not, you can use mv ppoi keyword or something like that as follows.

code:python

if keyword.ppoi(something):

print("{} is keyword-ppoi".format(something))

(I thought of import keyword as a bad name, so I added to_bool. I think keyword_ppoi.to_bool(something) is better.)

__init__.py can be executed directly. You can run it with --initialize.

Unjudged input must be placed in unknowns.txt before execution.

(I've thought about this a lot and it's not required, so I'll change it in a future version.)

(I felt it would be better to prepare a separate script for execution instead of executing __init__.py in the first place.)

The sample data to determine if it is a keyword or not is included in the samples. The contents of this sample is one line and one item, and contains examples of the data to determine if it is a keyword or not. This is a sample of the data in Scrapbox, which is mostly keywords, but some of the keywords are not keywords because they are misplaced in the square brackets in the source code.

code::

Elimination of Genus

self-introduction

voice input

\r\n

${currentPageTitle}

[ 0.67869952, 0.35340645, -0.37436676, 0.50602025, 0.13514392

cp samples/unknown.txt . Then do python3 __init__.py --initialize. You will then be asked to enter at least one "keyword-like" example and at least one "non-keyword-like" example. (We assume that Python3, NumPy and Scikit-Learn are already installed.)

code::

Enter at least one positive examples (empty to exit)

Enter as appropriate

code::

Enter at least one positive examples (empty to exit)

Elimination of Attributes

Self Introduction

Enter at least one positive examples (empty to exit)

${currentPageTitle}

[ 0.67869952, 0.35340645, -0.37436676, 0.50602025, 0.13514392

Then, based on the examples you gave, it will display "This looks like a keyword, best 5", "This doesn't look like a keyword, worst 5", and "I'm not sure which one it is".

code::

BEST 5

Good ideas inspire those around you to start growing on their own: 0.7077

Nature of freedom: 0.7077

Responsible for their own productivity: 0.7077

The less confident people are, the more they speak ill of others: 0.7077

There are two types of genus elimination: 0.7077

WORST 5

https://speakerdeck.com/kazuph/builderscon-tokyo-2018-day1-chan-ye-degatili-yong-sareruraspberry-pifalsehua: 0.2073

http://d.hatena.ne.jp/nishiohirokazu/20111020/1319117086: 0.2073

Elemwise{minimum,no_inplace}.0, Subtensor{:int64:}.0, Subtensor{:int64:}.0, Subtensor{:int64:}.0, Subtensor{:int64:}.0, [* IncSubtensor{Set;:int64:}.0: 0.2230

http://wedge.ismedia.jp/articles/-/12062?layout=b: 0.2230

https://www.okudahiromi.com/blog/20180310/1923: 0.2230

LESS CONFIDENT

kintone migration tips for people used to Facebook: 0.4967

03 1 Structuring Fragmentary Information_Frontal and Actual KJ Method Flow: 0.5056

03 2 Structuring Fragmentary Information_KJ Method Practice: 0.5056

Real numbers between 0 and 1 to {0, 1}: 0.5056

Supplement to DB Press article: 0.5056

If you are satisfied with the results, you can use them as soon as possible.

code::

In 1: import ppoi

In 2: ppoi.ppoi("Hello")

Out2: False

In 3: ppoi.ppoi("keyword extraction")

Out3: True

If you are not satisfied, you will have to add more training data. You don't have to edit the files manually if you use "interactive active learning," which I will explain later, but this time we will paste all five of the LESS CONFIDENT into positive.txt.

If you change the data, it will be relearned with python3 __init__.py --learn. This option only relearns without displaying anything, and this explanation is not interesting, so let's display the analysis results again with python3 __init__.py --describe.

Here are the results

code::

BEST 5

"Biologically evolved software will sometimes get sick like a cold, and if you leave it alone, it will go away. Now it's just a matter of being able to properly tie requirement specifications to death. : 0.9511

The good thing about a corporation is that the necessary condition of profit exists. In other words, the ability to go bankrupt is inherent. The function of bankruptcy is the best aspect of the free enterprise system. : 0.9231

Aggregation of organizations connected by people belonging to multiple organizations: 0.9125

"Even with equal rules (with seemingly high equality of results), there will be disparities, surprise," the argument is implicit in the parentheses, so isn't the tweak "If the WTA has low equality of results, there will be disparities" misplaced? The story is that. : 0.9081

(2) A private document shall be presumed to be duly executed when it is signed or sealed by the principal or his/her agent. : 0.9025

WORST 5

0.45792096, 0.20519401, 0.77490418, -0.08403411, -0.37505412: 0.2266

0.50602025, -0.08403411, 0.68316922, -0.34559589, 0.38823327: 0.2266

-0.20123254, 0.07652205, 0.50049221, 0.38823327, 0.74325791: 0.2266

-0.26671759, 0.89310368, -0.08178549, -0.34559589, 0.07142937: 0.2266

-0.37436676, 0.77490418, 0.04681577, -0.08178549, 0.50049221: 0.2266

LESS CONFIDENT

Feeling of Knowing in Memory and Problem Solving: 0.5010

2018-03: 0.5013

harajune: 0.4984

the structure of consumption technology: 0.4984

Scrapbox Drinkup 20180810: 0.5019

The sentences that begin with " , which are in the BEST 5 this time, are inline quotation syntax, and these guys can be played with a clear rule that "if it begins with " , it is not a keyword", so it is better to do it outside of ppoi. If something can be written in logic, it is better to write it in logic.

In cases where you can't play by those clear rules and can't separate the data well by adding more data, you need to add more features. Since _make_features is a function that takes a single line of text and returns an np.array, we would have to modify it.

Right now it assumes that the input is a string that does not contain newlines, but I will create other cases when I handle the other cases. Until then, I'll just use JSON for now.

interactive active learning.

When activated with --interactive, you can learn interactively. The data will be displayed in the order of the LESS CONFIDENT, and if you interactively answer positive or negative, the answer will be added to the data file. This is so-called active learning.

In addition to positive and negative, there is also neutral. If you cannot tell whether it is positive or negative at a glance, put everything in neutral.

code::

$ python3 __init__.py --interactive

Feeling of Knowing in Memory and Problem Solving: 0.5010