Squadgym Save

Environment that can be used to evaluate reasoning capabilities of artificial agents

Project README



Recently, the Question Answering dataset Stanford Question Answering Dataset (SQuAD) has gained a lot of attention from practitioners and researchers due its appealing properties for evaluating the capabilities of agents able to answer open domain questions. In this dataset, given a reference context and a question, the agent should be able to generate an answer which may be composed by multiple tokens which are present in the given context. Due to its high quality, it represents a relevant benchmark for intelligent agents able to grasp, from a given context, relevant evidences that are required to generate the answer.

The SQuAD dataset contains questions extracted from Wikipedia and related to specific entities. If the agent is able to extract from the related context text the sequence of tokens which compose the answer we may legitimately state that the system demonstrate sound reasoning capabilities. Obviously, the system should be able to generate an answer without exploiting supplementary features associated to the question or to the context but it should be able to "read" from the context text the correct answer.


The SQuAD-Gym represents a language game in which the agent receives multiple context-question pairs taken from the SQuAD dataset and for each of them, it should generate an answer composed by multiple tokens. According to the generated answer, the agent receives a question score which will be summed to the other scores obtained during the game. At the end of the game, it is generated a cumulative score which is the score that the agent should learn to maximize in the long run. It is worth to note that we do not require that the agent generates fixed responses but it should be able to generate a response by composing tokens together.

For instance, suppose that we are in a given match of the game and the agent receives the following context-question pair:


Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.


Which NFL team represented the NFC at Super Bowl 50?

We expect that the system by reading the text in the given context, should generate Carolina Panthers which is the correct answer to the given question according to the context text. After that, the system receives a score which is generated by computing the sentence-level BLEU score between the generated sequence and the possible target sequences present in the SQuAD dataset.


Before you can use the environment you need to download the SQuAD dataset in JSON format from the official website and run the build_env_data script in the following way:

python3 build_env_data.py squad_data.json env_data.pkl

In this way the script will generate a pickle file which will contain all the data required by the environment. After that, you can install the package using Python setuptools if you want to use it in your project or you can try it executing the env_test.py script specifying the pickle environment data generate by the build_env_data script:

python3 env_test.py env_data.pkl

Future work

This project represents a playground for artificial agents through which it will be possible to find an answer to the following question:

Is it possible to develop artificial agents able to answer open-domain questions which require different capabilities (e.g. ability to see, ability to hear, etc.)?

At the moment, the most important aspects that should be designed are reported in the following list:

  1. Extend the game in order to provide different scores to the agent through the game and not just the BLEU score between the generated answer and the target one;
  2. Design a possible curriculum learning strategy according to which the agent receives questions of increasing complexity while it plays the game;
  3. Design a multi-modal game: the agent should be able to answer question regarding textual data, images, songs or videos (like in the Italian game Rischiatutto).

The project is a work-in-progress. However, it can be really useful to share it with the community in order to obtain valuable feedback that can be leveraged so as to enhance and improve it. When the environment will be completed, it would be interesting to evaluate different kind of QA models, just like in the official SQuAD challenge.


The project is in its early stages so all contributions are incredibly well accepted. Feel free to open an issue if you find something wrong or create a new pull request if you want to extend SQuAD-Gym.


Alessandro Suglia -- my name dot my surname at yahoo dot com

Open Source Agenda is not affiliated with "Squadgym" Project. README Source: aleSuglia/squadgym
Open Issues
Last Commit
6 years ago

Open Source Agenda Badge

Open Source Agenda Rating