Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
https://github.com/X-PLUG/MobileAgent/assets/127390760/26c48fb0-67ed-4df6-97b2-aa0c18386d31
The demo can now be experienced at Hugging Face and ModelScope.
git clone https://github.com/X-PLUG/MobileAgent.git
cd MobileAgent
pip install -r requirements.txt
/path/to/adb devices
. If the connected devices are displayed, the preparation is complete.sudo chmod +x /path/to/adb
xx/xx/adb.exe
❗Since the GPT-4V will have severe hallucinations when perceiving non-English screenshots, we strongly recommend using Mobile-Agent under English-only systems and apps to ensure the performance.
❗Due to current limited resources, please contact us to get a free API Key consisting of a url and a token.
python run_api.py --adb_path /path/to/adb --url "The url you got" --token "The token you got" --instruction "your instruction"
python run.py --grounding_ckpt /path/to/GroundingDION --adb_path /path/to/adb --api "your API_TOKEN" --instruction "your instruction"
API_TOKEN is an API Key from OpenAI with the permission to access gpt-4-vision-preview
.
Mobile-Eval is a benchmark designed for evaluating the performance of mobile device agents. This benchmark includes 10 mainstream single-app scenarios and 1 multi-app scenario.
For each scenario, we have designed three instructions:
The detailed content of Mobile-Eval is as follows:
Application | Instruction |
---|---|
Alibaba.com | 1. Help me find caps in Alibaba.com. 2. Help me find caps in Alibaba.com. If the "Add to cart" is available in the item information page, please add the item to my cart. 3. I want to buy a cap. I've heard things are cheap on Alibaba.com. Maybe you can find it for me. |
Amazon Music | 1. Search singer Jay Chou in Amazon Music. 2. Search a music about "agent" in Amazon Music and play it. 3. I want to listen music to relax. Find an App to help me. |
Chrome | 1. Search result for today's Lakers game. 2. Search the information about Taylor Swift. 3. I want to know the result for today's Lakers game. Find an App to help me. |
Gmail | 1. Send an empty email to to {address}. 2. Send an email to {address}n to tell my new work. 3. I want to let my friend know my new work, and his address is {address}. Find an App to help me. |
Google Maps | 1. Navigate to Hangzhou West Lake. 2. Navigate to a nearby gas station. 3. I want to go to Hangzhou West Lake, but I don't know the way. Find an App to help me. |
Google Play | 1. Download WhatsApp in Play Store. 2. Download Instagram in Play Store. 3. I want WhatsApp on my phone. Find an App to help me. |
Notes | 1. Create a new note in Notes. 2. Create a new note in Notes and write "Hello, this is a note", then save it. 3. I suddenly have something to record, so help me find an App and write down the following content: meeting at 3pm. |
Settings | 1. Turn on the dark mode. 2. Turn on the airplane mode. 3. I want to see the real time internet speed at the battery level, please turn on this setting for me. |
TikTok | 1. Swipe a video about pet cat in TikTok and click a "like" for this video. 2. Swipe a video about pet cat in TikTok and comment "Ohhhh, so cute cat!". 3. Swipe videos in TikTok. Click "like" for 3 pet video cat. |
YouTube | 1. Search for videos about Stephen Curry on YouTube. 2. Search for videos about Stephen Curry on YouTube and open "Comments" to comment "Oh, chef, your basketball spirit has always inspired me". 3. I need you to help me show my love for Stephen Curry on YouTube. |
Multi-App | 1. Open the calendar and look at today's date, then go to Notes and create a new note to write "Today is {today's data}". 2. Check the temperature in the next 5 days, and then create a new note in Notes and write a temperature analysis. 3. Search the result for today's Lakers game, and then create a note in Notes to write a sport news for this result. |
We evaluated Mobile-Agent on Mobile-Eval. The evaluation results are available at LINK.
results/Google Maps/2
.If you find Mobile-Agent useful for your research and applications, please cite using this BibTeX:
@article{wang2024mobile,
title={Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception},
author={Wang, Junyang and Xu, Haiyang and Ye, Jiabo and Yan, Ming and Shen, Weizhou and Zhang, Ji and Huang, Fei and Sang, Jitao},
journal={arXiv preprint arXiv:2401.16158},
year={2024}
}