If the pattern has been computed based on the name given in output, the score is 1. Remember, when a name is parsed it has a score that is associated. When extracting the pattern of an email address using a parsed name, the \textit{name score} influences the score of the pattern associated with the company. The pattern score corresponds to the average of name score from which the pattern has been extracted. The way this score is computed is clearer if you look at the \cref{fig:pattern_finder_flow}. \subsection{Prediction score} …show more content…
If a company is using 3 different patterns distributed as (0.8, 0.1, 0.1) and 3 only email addresses have been analyzed, it is very unlikely to have discovered the 3 …show more content…
The score decreases with the number of different patterns that have been found for a company since the less likely it gets to find a new one. No statistical analysis have been made. Instead, this concept is approximated with the following idea: If one pattern only is known, then we consider there is 50\% chance to find a different one next time. If we know 1 pattern with frequency 2, then the next one will have 66\% chance to be a new one. If 2 patterns are known, each with frequency 1, then the probability to find a new one should be lower. This behavior comes from the assumption $A2$ discussed