A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Serial Number | Dataset Description | Num Train Steps | steps_per_stats | Num_Layers | Num_Units | Dropout | Attention (With Type) | Embeddings (Description) | Done / Not Done | Best BLEU | Best Accuracy | Time Taken | F1 | F1 on another set: deasciption |score. | Inference |
2 | Attention or no Attention | |||||||||||||||
3 | 1 | - The test, train and dev sets were randomly shuffled and split in the ratio 10:80:10 | 40000 | 100 | 2 | 128 | 0.2 | No | No | Done | 97.69 | 89.75 | ~1 hour minutes | On the special test set | 0.7891 | ||
4 | 2 | - The test, train and dev sets were randomly shuffled and split in the ratio 10:80:10 | 40000 | 100 | 2 | 128 | 0.2 | Yes (scaled luong) | No | Done | 97.5 | 88 | ~1 hour 33 minutes | |||
5 | 3 | Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other. | 30000 | 100 | 2 | 128 | 0.2 | No | No | Done | 66.39 | 5.71 | ~1 hour 30 minutes (empirical) | |||
6 | 4 | Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other. | 40000 | 100 | 2 | 128 | 0.2 | Yes (scaled luong) | No | Done | 85.16 | 34.29 | ~1 hour 30 minutes | |||
7 | The model performance seemed to immensely increase on using the attention mechanism, though the increase was not stable. But over all the BLEU and Accuracy reached good values and thus points us towards the advantages of using attention in our further studies. The rest of the following experiments will use attention. | |||||||||||||||
8 | Fix Dropout | |||||||||||||||
9 | 1 | Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other. | 40000 | 100 | 2 | 128 | 0.05 | Yes (scaled luong) | No | Done | 58 | 0 | ~1 hour 40 minutes (My PC) | |||
10 | 2 | Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other. | 40000 | 100 | 2 | 128 | 0.5 | Yes (scaled luong) | No | Done | 86 | 40.9 | ~1.5 hours (My PC) | 0.5172 | Making the dropout value to 0.5 from 0.2 boosted the consistency on the model as well as set a new record for the model performance on test set to 86 BLEU and 40.9% Accuracy with an F1 score using GERBIL = 0.5172 (Macro F1 QALD). The only major issue faced by the model was with dbo_species and dbo_family. The model keeps confusing between the 2. | |
11 | 3 | Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other. | 50000 | 100 | 2 | 128 | 0.7 | Yes (scaled luong) | No | Done | 85 | 45.06 | ~4 hours 15 minutes (College PC) | |||
12 | 4 | Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other. | 50000 | 100 | 2 | 128 | 0.9 | Yes (scaled luong) | No | Done | 59.6 | 2.3 | ~ 2 hours 18 minutes (GCP K80) | |||
13 | Fix attention type | |||||||||||||||
14 | 1 | Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other. | 50000 | 100 | 2 | 128 | 0.5 | Yes (luong) | No | Done | 76.7 | 9.1 | ~ 1 hour 40 minutes (empirical) | |||
15 | 2 | Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other. | 50000 | 100 | 2 | 128 | 0.5 | Yes (bahdanau) | No | Done | 62.5 | 0 | ~ 1 hour 40 minutes (empirical) | |||
16 | 3 | Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other. | 40000 | 100 | 2 | 128 | 0.5 | Yes (scaled luong) | No | Done | 86 | 40.9 | ~1.5 hours (My PC) | 0.5172 | Making the dropout value to 0.5 from 0.2 boosted the consistency on the model as well as set a new record for the model performance on test set to 86 BLEU and 40.9% Accuracy with an F1 score using GERBIL = 0.5172 (Macro F1 QALD). The only major issue faced by the model was with dbo_species and dbo_family. The model keeps confusing between the 2. | |
17 | Fix Number of Units | |||||||||||||||
18 | 1 | Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other. | 50000 | 100 | 2 | 256 | 0.5 | Yes(scaled luong) | No | Done | 82.9 | 25 | ~ 10 hours 43 minutes (College PC) | |||
19 | 2 | Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other. | 50000 | 100 | 2 | 512 | 0.5 | Yes(scaled luong) | No | Done | 55.8 | 0 | ~2 hours (GCP) | |||
20 | Fix Number of Layers | |||||||||||||||
21 | 1 | Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other. | 4000 | 100 | 1 | 128 | 0.7 | Yes (scaled luong) | No | Done | (The model did not learn beyond 400 iterations and terminated automatically with very high perplexity) | 1 | 0 | ~ 15 minutes | |||
22 | 2 | Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other. | 50000 | 100 | 3 | 128 | 0.5 | Yes (scaled luong) | No | Done | 58.7 | 0 | 5 hours 13 minutes | |||
23 | 3 | Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other. | 40000 | 100 | 2 | 128 | 0.5 | Yes (scaled luong) | No | Done | 86 | 40.9 | ~1.5 hours (My PC) | 0.5172 | Making the dropout value to 0.5 from 0.2 boosted the consistency on the model as well as set a new record for the model performance on test set to 86 BLEU and 40.9% Accuracy with an F1 score using GERBIL = 0.5172 (Macro F1 QALD). The only major issue faced by the model was with dbo_species and dbo_family. The model keeps confusing between the 2. | |
24 | 4 | Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other. | 50000 | 100 | 4 | 128 | 0.5 | Yes (scaled luong) | No | Done | 63 | 0 | ~1 hour 30 minutes | |||
25 | Use Embeddings | |||||||||||||||
26 | 1 | Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other. | 50000 | 100 | 2 | 128 | 0.7 | Yes (scaled luong) | Yes | SPARQL: Biased Graph walks | English: From Previous Models | OOV words were randomly initialized | Done | 93 | 63 | ~1 hour 30 minutes | Best results at 15000 iterations, Interestingly the train set had a hard time to reach good performance. The performance dipped after 40,000 iterations to 85 BLEU and 25% accuracy. | |||
27 | The initial part of using embedding already trained previously using the same model usually only increased the learning speed. GIven the embeddings are being used for eukaryotes, the general fasttext models don't have the relavant vocabulary. Need to change the dataset for further evaluations. | |||||||||||||||
28 | 1 | Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other. | 50000 | 100 | 2 | 128 | 0.7 | Yes (scaled luong) | Yes | fastext for english | Not Done | ||||||
29 | 2 | Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other. | 50000 | 100 | 2 | 128 | 0.7 | Yes (scaled luong) | Yes | fasttext for english | fasttext for SPARQL | Not Done | ||||||
30 | 15 | 30000 | 100 | 2 | 256 | 0.5 | Yes | No |