Reducing false positives advanced tagging model
In identifying Multi-Word Units in an input text, it is sometimes necessary to distinguish identical lexical strings that are not actually a MWU. These example includes "use to" as a modal verb and as a part of passive construction.
When it comes to removing the false positives ("used to" as a part of passive) it is necessary to use contextual information.
There are some approaches to do this. These include but not limited to:
checking the surrounding contexts in terms of the lemma. For example, most of the "use to" in passive structure should follow copula verbs. So, we can narrow down by checking the preceding words.
Pros: This is simple when it comes to implementation
Cons: Not all copula verbs occur immediately before "used to" (e.g., That machine was often used to manufacture the tool.)
Extract the syntactic dependencies and check if the relationships.
For example, if the "used to" is used as a modal verb, it should only govern the subject and the lexical verb that comes after to. On the other hand, if the "used to" is used as a part of passive strucure, it should additionally govern copula verb. So, identifying the syntactic relations should help reduce false positives in identifying MWUs.
Pros: The relationships can be identified even if there are intervening words (thus can overcome the shortcoming of the first approach)
Cons:
As we might expect, this approach is a bit complicated (although it is entirely possible).
For instance, we need to go through all the MWUs and their potential false positives manually to find "rules" that can distinguish in usage.
Documenting rules need to be exhaustive. The precision at this stage will directly affect the ability to differentiate the MWUs.
There are some works that need to be done on coding too (a bit time consuming).