DF
Document Frequency
Affected by [Document Granularity
As an extreme example, if we take one word and one document, we can match the TF
Often set to "1" if it appears more than once.
I mean, you're multiplying a step function.
The number of times is used as the threshold, a value that naturally tends to increase as the number of words in the document increases
Wouldn't it be better to divide by the number of words to get the probability of occurrence...
---
This page is auto-translated from /nishio/DF. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.