MAPWise Evaluating Vision-Language Models for Advanced Map Queries

![rw-book-cover](https://readwise-assets.s3.amazonaws.com/media/reader/parsed_document_assets/323205078/65PmwcbEuyF-hVshDa84Ct7wOtlDdqFb3JdXclysGx0-cove_6cvMXKN.png) --- > The dataset consists of maps representing data in two primary forms: discrete, where the legend is divided into distinct groups and continuous, where the legend is distributed over a spectrum. The maps also include variations in the presence or absence of annotations, which pro- vide additional contextual information. Our dataset also includes maps with black-and-white textured patterns or hatches for discrete data, different color- map variations (light, dark, and gradient scales), and varying paper background colors (white and grey). These variations test the models’ capability to handle diverse visual presentations. - [View Highlight](https://read.readwise.io/read/01jxwf1p1e3pycbdjxgt6smawx) --- > Each question could have answers in one of the following formats: Binary (Yes/No), Single Word, Count, List, Range, and Ranking. Examples of these are shown in Table 1. All questions were man- ually created by expert annotators, with the help of provided templates, with 10 questions created for each map. Overall, we created 1000 question- answer pairs for each country. - [View Highlight](https://read.readwise.io/read/01jxwf49267xwehy3zxg36nw55) --- > We evaluated the baseline models under two dis- tinct prompting settings: 1. Zero-Shot Chain-of-Thought Prompting (COT). We leverage the Chain-of-Thought (Wei et al., 2023) prompting, presenting the VLM with a map and a question, prompting it to reason through the steps leading to its final answer. > 2. Explicit Extraction and Reasoning (EER). > Here, we created a custom prompt that ex- plicitly outlined the reasoning steps the model should follow to answer the specific question. This prompt was broken down into four dis- tinct reasoning steps - [View Highlight](https://read.readwise.io/read/01jxwf5dgws3cechp18n0dh9ea) --- > For binary yes/no and integer count answers, we implemented an exact match criterion and accu- racy as the evaluation metric. For single-word an- swers, as some questions have multiple applicable responses, we employed the recall metric for better evaluation. For state names, a valid answer could be either a two-digit state code or the full state name. For ranges, we first normalized the ranges to absolute values (e.g. 1k to 1000) and then com- pared them. For discrete maps, only exact match was expected, whereas for continuous maps, we gave a full score of 1 for exact match and a partial score of 0.5 for overlapping responses. > For list type answer, we used precision and recall metrics because predicted lists often contained ir- relevant states (false positives) and missed relevant states (false negatives). > For rank-type answers, we prompted the model to assign ranks to states based on map values. How- ever, due to the difficulty in accurately distinguish- ing shades, models frequently assigned states to wrong shades, resulting in multiple states sharing the same rank despite differing shades. - [View Highlight](https://read.readwise.io/read/01jxwf7t15zrae3j3j776srrnt) --- > Smaller models, like QwenVL, CogAgent, and InternLM, faced challenges in producing answers in the desired for- mat. To address this, we used an "LLM as an Extractor" approach, using Gemini 1.5 Flash to extract answers from their outputs. - [View Highlight](https://read.readwise.io/read/01jxwf8fjprv65r22d4mkhwcg7) ---