These experiments were done for the paper WANDA – a Higher Order Termination Tool, presented at FSCD 2020. All experiments were performed with Wanda version 2.2, on a 60 second timeout.
We performed four series of experiments, three of which are (patially) noted in the paper. In the first, we evaluated WANDA on the Termination Problem Database (TPDB), version 10.6. We used various configurations, to evaluate the power of various techniques described in the paper. This gives the following results:
Technique | Wanda configuration | #Yes | #No | Perc. | Avg. time |
WANDA without restrictions | ./wanda.exe | 188 | 16 | 78% | 1.14 |
Only rule removal | ./wanda.exe -d nt,dp | 123 | 0 | 47% | 1.13 |
Only rule removal with StarHorpo | ./wanda.exe -d poly,nt,dp | 111 | 0 | 42% | 0.24 |
Only rule removal with Polynomial Interpretations | ./wanda.exe -d horpo,nt,dp | 59 | 0 | 23% | 0.07 |
Only dependency pairs | ./wanda.exe -d rr,nt | 186 | 0 | 71% | 1.02 |
Only static dependency pairs | ./wanda.exe -d dynamic,rr,nt | 152 | 0 | 58% | 0.55 |
Only dynamic dependency pairs | ./wanda.exe -d static,rr,nt | 167 | 0 | 64% | 1.30 |
Full WANDA, but without a first-order tool | ./wanda.exe -i none | 183 | 9 | 74% | 0.90 |
Full WANDA, but without simplifying fake patterns | ./wanda.exe --dontsimplify | 155 | 16 | 66% | 0.75 |
The second set of experiments is not included in the paper. This set of experiments considers the TPDB without the two subdirectories Uncurried_Applicative_11 and Hamana_Kikuchi_18. The reason for this separation is that those two directories seem to contain largely generated benchmarks – there is a lot of repetition of the same few rules (mostly fold, rec and map), coupled with different first-order systems or called in slightly different ways. While it is interesting to see how tools perform on these variations, it does carry a significant risk of skewing the results. So, we isolated the results on the remaining benchmarks, which allows us to more accurately assess the power on truly higher-order systems. This gives the following results:
Technique | Wanda configuration | #Yes | #No | Perc. | Avg. time |
WANDA without restrictions | ./wanda.exe | 67 | 5 | 82% | 0.45 |
Only rule removal | ./wanda.exe -d nt,dp | 39 | 0 | 44% | 0.27 |
Only rule removal with StarHorpo | ./wanda.exe -d poly,nt,dp | 29 | 0 | 33% | 0.12 |
Only rule removal with Polynomial Interpretations | ./wanda.exe -d horpo,nt,dp | 27 | 0 | 31% | 0.06 |
Only dependency pairs | ./wanda.exe -d rr,nt | 66 | 0 | 66% | 0.69 |
Only static dependency pairs | ./wanda.exe -d dynamic,rr,nt | 51 | 0 | 58% | 0.35 |
Only dynamic dependency pairs | ./wanda.exe -d static,rr,nt | 54 | 0 | 61% | 0.81 |
Full WANDA, but without a first-order tool | ./wanda.exe -i none | 67 | 4 | 81% | 0.37 |
Full WANDA, but without simplifying fake patterns | ./wanda.exe --dontsimplify | 64 | 5 | 78% | 0.63 |
(full result table for TPDB 2018, with two directories removed)
As the remaining benchmarks do not have a significant first-order section, the first-order tool makes very little difference. It is also notable that StarHorpo and Polynomial Interpretations perform much more equally, as do static and dynamic dependency pairs.
The third set of experiments is mentioned in the paper, although the tables were not included. This set of experiments tests WANDA's performance when coupled with different first-order termination back-ends. The first three tests couple WANDA with AProVE, NaTT and MU-TERM respectively for both termination and non-termination; the last test uses NaTT for termination and AProVE for non-termination. The first table below considers all benchmarks in the TPDB; the second table considers all except the most recently added category, since the difference is remarkable.
Setup | Wanda configuration | #Yes | #No | #Maybe | #Timeout | Perc. | Avg. time |
AProVE for termination and non-termination | ./wanda.exe | 188 | 16 | 24 | 33 | 78% | 1.20 |
NaTT for termination and non-termination | ./wanda.exe -i nattprover -n nattprover | 201 | 10 | 35 | 15 | 81% | 1.87 |
MU-TERM for termination and non-termination | ./wanda.exe -i mutermprover -n mutermprover | 188 | 9 | 31 | 33 | 75% | 0.87 |
NaTT for termination, AProVE for non-termination if NaTT fails | ./wanda.exe -i nattprover | 200 | 17 | 24 | 20 | 83% | 1.88 |
No first-order tool | ./wanda.exe -i none | 183 | 9 | 46 | 23 | 74% | 0.89 |
(full result table for WANDA with various first-order tools on the TPDB 2018)
Setup | Wanda configuration | #Yes | #No | #Maybe | #Timeout | Perc. | Avg. time |
AProVE for termination and non-termination | ./wanda.exe | 155 | 16 | 21 | 6 | 86% | 0.72 |
NaTT for termination and non-termination | ./wanda.exe -i nattprover -n nattprover | 154 | 9 | 32 | 3 | 82% | 0.37 |
MU-TERM for termination and non-termination | ./wanda.exe -i mutermprover -n mutermprover | 155 | 9 | 28 | 6 | 83% | 0.38 |
NaTT for termination, AProVE for non-termination if NaTT fails | ./wanda.exe -i nattprover | 153 | 16 | 21 | 8 | 85% | 0.41 |
No first-order tool | ./wanda.exe -i none | 150 | 9 | 36 | 3 | 80% | 0.41 |
(full result table for WANDA with various first-order tools on the TPDB 2017)
It is notable that NaTT's advantage is very local; on the benchmarks in the 2017 TPDB, all three tools perform equally. It seems likely that later updates of MU-TERM and AProVE will also handle the tools in Hamana_Kikuchi_18/. Otherwise, what we can observe by investigating the cases where a first-order tool is needed to conclude YES is that all these benchmarks have a simple higher-order part: in all cases, a proof with static dependency pairs and the subterm criterion (or computable subterm criterion) suffices for the higher-order part. This may suggest that using a first-order tool for termination analysis is not worthwhile when the higher-order part has a different form. However, it might also just be a case of the database: for most benchmarks, a first-order tool is not needed at all, and the few where it is needed here are not a sufficiently diverse sample space to confidently draw any conclusions.
The final set of experiments considers WANDA's power on the Confluence Problems database (COPS), 2019 version. As WANDA cannot read HRS input, the benchmarks were converted by CSI^ho, with some manual changes (this was necessary because, due to lacking documentation on WANDA's input restrictions, some of the generated AFSMs could not be read by WANDA). We used the same configurations as before, except the "simplifying fake patterns" configuration, which only makes sense for AFS input.
Technique | Wanda configuration | #Yes | #No | Perc. | Avg. time |
WANDA without restrictions | ./wanda.exe --betafirst | 43 | 30 | 78% | 0.09 |
Only rule removal | ./wanda.exe -d nt,dp --betafirst | 37 | 0 | 40% | 0.11 |
Only rule removal with StarHorpo | ./wanda.exe -d poly,nt,dp --betafirst | 33 | 0 | 35% | 0.17 |
Only rule removal with Polynomial Interpretations | ./wanda.exe -d horpo,nt,dp --betafirst | 21 | 0 | 23% | 0.01 |
Only dependency pairs | ./wanda.exe -d rr,nt --betafirst | 43 | 0 | 46% | 0.51 |
Only static dependency pairs | ./wanda.exe -d dynamic,rr,nt --betafirst | 37 | 0 | 40% | 0.26 |
Only dynamic dependency pairs | ./wanda.exe -d static,rr,nt --betafirst | 40 | 0 | 43% | 0.77 |
Full WANDA, but without a first-order tool | ./wanda.exe -i none --betafirst | 43 | 30 | 78% | 0.07 |
The longer average time when using dependency pairs is largely due to a single benchmark (433) for which rule removal is very effective, but which the non-monotonic reduction pair processor takes much longer to handle.