A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P
1	Serial Number	Dataset Description	Num Train Steps	steps_per_stats	Num_Layers	Num_Units	Dropout	Attention (With Type)	Embeddings (Description)	Done / Not Done	Best BLEU	Best Accuracy	Time Taken	F1	F1 on another set: deasciption \|score.	Inference
2	Attention or no Attention
3	1	- The test, train and dev sets were randomly shuffled and split in the ratio 10:80:10	40000	100	2	128	0.2	No	No	Done	97.69	89.75	~1 hour minutes		On the special test set \| 0.7891
4	2	- The test, train and dev sets were randomly shuffled and split in the ratio 10:80:10	40000	100	2	128	0.2	Yes (scaled luong)	No	Done	97.5	88	~1 hour 33 minutes
5	3	Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other.	30000	100	2	128	0.2	No	No	Done	66.39	5.71	~1 hour 30 minutes (empirical)
6	4	Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other.	40000	100	2	128	0.2	Yes (scaled luong)	No	Done	85.16	34.29	~1 hour 30 minutes
7	The model performance seemed to immensely increase on using the attention mechanism, though the increase was not stable. But over all the BLEU and Accuracy reached good values and thus points us towards the advantages of using attention in our further studies. The rest of the following experiments will use attention.
8	Fix Dropout
9	1	Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other.	40000	100	2	128	0.05	Yes (scaled luong)	No	Done	58	0	~1 hour 40 minutes (My PC)
10	2	Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other.	40000	100	2	128	0.5	Yes (scaled luong)	No	Done	86	40.9	~1.5 hours (My PC)	0.5172		Making the dropout value to 0.5 from 0.2 boosted the consistency on the model as well as set a new record for the model performance on test set to 86 BLEU and 40.9% Accuracy with an F1 score using GERBIL = 0.5172 (Macro F1 QALD). The only major issue faced by the model was with dbo_species and dbo_family. The model keeps confusing between the 2.
11	3	Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other.	50000	100	2	128	0.7	Yes (scaled luong)	No	Done	85	45.06	~4 hours 15 minutes (College PC)
12	4	Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other.	50000	100	2	128	0.9	Yes (scaled luong)	No	Done	59.6	2.3	~ 2 hours 18 minutes (GCP K80)
13	Fix attention type
14	1	Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other.	50000	100	2	128	0.5	Yes (luong)	No	Done	76.7	9.1	~ 1 hour 40 minutes (empirical)
15	2	Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other.	50000	100	2	128	0.5	Yes (bahdanau)	No	Done	62.5	0	~ 1 hour 40 minutes (empirical)
16	3	Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other.	40000	100	2	128	0.5	Yes (scaled luong)	No	Done	86	40.9	~1.5 hours (My PC)	0.5172		Making the dropout value to 0.5 from 0.2 boosted the consistency on the model as well as set a new record for the model performance on test set to 86 BLEU and 40.9% Accuracy with an F1 score using GERBIL = 0.5172 (Macro F1 QALD). The only major issue faced by the model was with dbo_species and dbo_family. The model keeps confusing between the 2.
17	Fix Number of Units
18	1	Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other.	50000	100	2	256	0.5	Yes(scaled luong)	No	Done	82.9	25	~ 10 hours 43 minutes (College PC)
19	2	Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other.	50000	100	2	512	0.5	Yes(scaled luong)	No	Done	55.8	0	~2 hours (GCP)
20	Fix Number of Layers
21	1	Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other.	4000	100	1	128	0.7	Yes (scaled luong)	No	Done \| (The model did not learn beyond 400 iterations and terminated automatically with very high perplexity)	1	0	~ 15 minutes
22	2	Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other.	50000	100	3	128	0.5	Yes (scaled luong)	No	Done	58.7	0	5 hours 13 minutes
23	3	Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other.	40000	100	2	128	0.5	Yes (scaled luong)	No	Done	86	40.9	~1.5 hours (My PC)	0.5172		Making the dropout value to 0.5 from 0.2 boosted the consistency on the model as well as set a new record for the model performance on test set to 86 BLEU and 40.9% Accuracy with an F1 score using GERBIL = 0.5172 (Macro F1 QALD). The only major issue faced by the model was with dbo_species and dbo_family. The model keeps confusing between the 2.
24	4	Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other.	50000	100	4	128	0.5	Yes (scaled luong)	No	Done	63	0	~1 hour 30 minutes
25	Use Embeddings
26	1	Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other.	50000	100	2	128	0.7	Yes (scaled luong)	Yes \| SPARQL: Biased Graph walks \| English: From Previous Models \| OOV words were randomly initialized	Done	93	63	~1 hour 30 minutes \| Best results at 15000 iterations, Interestingly the train set had a hard time to reach good performance. The performance dipped after 40,000 iterations to 85 BLEU and 25% accuracy.
27	The initial part of using embedding already trained previously using the same model usually only increased the learning speed. GIven the embeddings are being used for eukaryotes, the general fasttext models don't have the relavant vocabulary. Need to change the dataset for further evaluations.
28	1	Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other.	50000	100	2	128	0.7	Yes (scaled luong)	Yes \| fastext for english	Not Done
29	2	Special Set: Separate test + Same Vocab + Frequency thresholding. The test and train tests were exclusiveof each other.	50000	100	2	128	0.7	Yes (scaled luong)	Yes \| fasttext for english \| fasttext for SPARQL	Not Done
30	15		30000	100	2	256	0.5	Yes	No