Bulb Project Vision

In this era of disruptive AI technology, a large number of influential and representative large-scale models have emerged both domestically and internationally, including but not limited to:

ChatGPT by OpenAI
ChatGLM by Zhipu AI
Mianbi Model by Model Best Intelligence
ERNIE Bot by Baidu
Tongyi Qianwen by Alibaba
Spark Model by iFlytek

We attempt to have conversations with these future AGIs to understand their thoughts and break down the cognitive barriers between humans and AGIs. Bulb is named "Bulb" with the aim of shedding light on the boundaries of large-scale model capabilities, providing us with a more specific and quantitative understanding of them. For more detailed information, visit our official website: https://bulb.atomecho.cn

In the future, we will continue to introduce new members to the family of large-scale models. To explore the capabilities of each model more fairly, we have given each model a nickname, including: Radish, Eggplant, Carrot, Chili, Pumpkin, Potato, etc.

How to Evaluate the Abilities of Large Models:

Due to the strong generalization abilities of large models across different tasks and knowledge domains, evaluating them objectively and accurately is a significant challenge. For this purpose, we have collected a test set comprising over 1000 questions (continuously updated) to evaluate the performance of the large model. These test questions cover various dimensions of the large model's capabilities and effectively demonstrate its general ability.

Format of Evaluation Questions:

The data is contained in question.json。

"qid": Question ID
"question": Question description
"first_level_class": First-level classification
"second_level_class": Second-level classification

For example:

[
    {
        "qid": 2,
        "question": "What is machine learning? What are its applications?",
        "first_level_class": "General Knowledge",
        "second_level_class": "Natural Sciences"
    },
	...
]

Format of Answers:

The data is contained in answer-.json under each model directory, where the "date" field represents the specific time when the answers were obtained, allowing us to correspond them to specific model versions. In answer-.json, the specific field information is as follows. We will further expand the user feedback data in the future:

 "qid": Question ID
 "answer": Model's answer

[
    {
        "qid": 0,
        "answer": "Machine learning refers to a method of using computer programs to train data, enabling computers to possess intelligence and adaptability.\n\nMachine learning can be widely applied in various fields, including but not limited to the following:\n\n1. Natural Language Processing: Examples include machine translation, text classification, sentiment analysis, etc.\n\n2. Finance: Examples include credit risk assessment, portfolio optimization, fraud detection, etc.\n\n3. Healthcare: Examples include disease prediction, medical image analysis, drug discovery, etc.\n\n4. Intelligent Transportation: Examples include traffic prediction, autonomous driving, smart public transportation, etc.\n\n5. E-commerce: Examples include recommendation systems, targeted advertising, user profiling, etc.\n\n6. Industrial Manufacturing: Examples include quality prediction, production scheduling optimization, supply chain management, etc.\n\n7. Security Monitoring: Examples include video surveillance, intrusion detection, network security, etc.\n\n8. Gaming: Examples include intelligent game NPCs, game difficulty balancing, dynamic story generation, etc.
",
	"score": 8
    },
	...
]

How to Score Answers from Different LLMs:

Due to the strong generalization abilities of large models across different tasks and knowledge domains, evaluating them objectively and accurately is a significant challenge. Regarding the scoring mechanism, we adopt two approaches:

Scoring with OpenAI's GPT-4: GPT-4 is currently recognized as one of the most capable large models. Therefore, we use GPT-4 to assess the question-answering performance of each model. We use the following format to pose questions to GPT-4:

You are a helpful and accurate assistant used to check the quality of the answers provided for specific questions.
[Question]: {question}
[Answer]: {answer}
We would like you to provide feedback on the performance of an AI assistant in answering the given question. You will be given an overall performance score on a scale from 0 to 10, where higher scores indicate better overall performance. Please start by outputting a single line containing a numerical value representing the assistant's score. On the following line, please provide a comprehensive explanation of your evaluation, avoiding any potential biases.

The above prompt is displayed in English, but in actual evaluations, we use Chinese prompts. The content of the Chinese and English prompts is essentially the same.

In the above prompt, {question} and {answer} correspond to the content of the question being tested and the response from the large model, respectively. Here is a specific example:

You are a helpful and accurate assistant for reviewing the quality of answers to specific questions.
[Question]: What is immunity? Please answer the following questions, ensuring the accuracy of the answers.  
[Answer]: Immunity refers to the ability of a human or animal body to respond and defend against the invasion, infection, and disease caused by foreign pathogens. Within the human body, there is a complex set of physiological mechanisms that can detect and attack potential viruses, bacteria, fungi, and other microorganisms, including the innate immune system and the adaptive immune system. Immunity can be enhanced through various methods, such as a healthy diet, good quality sleep, exercise, vaccinations, and avoiding chronic stress.
Please provide feedback on the performance of an AI assistant in answering the following question. You will be given an overall performance score on a scale of 0 to 10, where higher scores indicate better overall performance. First, please output a single line containing a numerical value representing the assistant's score. On the following lines, provide a comprehensive explanation of your evaluation, avoiding any potential biases.

Based on user feedback, we rate the answers by displaying responses from different models beneath each question. Users can "like" or "dislike" each answer. We aggregate the scores for each model's answers to evaluate their performance. This is considered the most objective and accurate evaluation method, but it relies on a large volume of user feedback. Therefore, we welcome every user to provide their evaluation below the answers. Your feedback will influence the future development direction of AGI.

From which aspects to evaluate large models:

Based on our extensive compilation of open-source question-answering data, social media, forums, and books (such as BELLE, "Ten Thousand Whys," "Weak-IQ Baidu Tieba," etc.), and through repeated discussions and refinement, we have designed eight major categories to explore the capabilities of large models. These categories include: general knowledge, language understanding, creative abilities, logical reasoning, code programming, job skills, tool utilization, and personality traits. Under each major category, there are further subdivisions to cover the range of capabilities of large models as comprehensively as possible. Detailed descriptions for each category are provided below, and we also welcome valuable input from everyone to make this classification system more comprehensive and accurate.

General Knowledge: General knowledge covers basic concepts and information from various aspects such as basic sciences, mathematics, history and culture, language and literature, social sciences, everyday life skills, and modern technological applications.
Social Sciences and Humanities	Social sciences and humanities knowledge include multiple disciplines such as history, philosophy, language and literature, art, religion, sociology, anthropology, political science, law, and economics. It involves the study and understanding of human society, culture, values, behaviors, and ways of thinking in these fields.
Natural Sciences	Natural sciences knowledge encompasses systematic knowledge and methods from various disciplinary fields, including physics, chemistry, biology, geography, geology, and astronomy.
Everyday Life Skills	Everyday life skills knowledge contains practical information and techniques related to daily life, dietary nutrition, home safety, transportation, and more, aiming to help people live a more convenient, healthy, and safe life.
Artistic Creation	Artistic creation knowledge includes creative techniques, principles, and aesthetic concepts of various art forms such as painting, sculpture, photography, music, dance, theater, and film.
Economics	Economic knowledge includes understanding of supply and demand relationships, principles of market economy, fiscal and monetary policies, consumer and producer behavior, as well as basic macroeconomic indicators such as GDP, inflation rate, unemployment rate, etc.
Timely Information	Timely information includes events, data, trends, news, and technological updates with specific time constraints or relevance to specific time periods.
Technology	Technology knowledge includes understanding and skills in operating and applying various modern technological devices, such as computer hardware and software, internet technology, mobile communications, artificial intelligence, data analysis, machine learning, e-commerce, technological innovation, as well as related industries and regulatory standards.
Medical and Health	Medical and health knowledge covers various aspects of medical expertise and practical skills, including disease prevention, diagnosis, treatment, rehabilitation, physical and mental health, and nutrition.
Language Understanding: Language understanding covers the comprehension of various aspects such as vocabulary, grammar, syntax, semantics, rhetoric, context, and non-verbal cues, as well as how to combine them to understand and interpret the meaning and purpose of oral or written language.
Translation	Translation is the process of accurately and completely converting text or speech from one language to another, aiming to overcome language barriers and convey and communicate information, culture, and thoughts.
Grammar and Vocabulary	Grammar and vocabulary proficiency includes mastery of words, phrases, sentence structure rules in a language, and the ability to use these rules to express meaning correctly.
Summarization	Summarization ability includes extracting key information, summarizing main points, integrating core elements, and presenting content in a concise manner.
Information Extraction	Information extraction ability involves key information identification and association techniques, such as entity recognition, relationship extraction, event extraction, and attribute extraction.
Sentiment Analysis	Sentiment analysis ability includes recognizing, analyzing, and understanding emotions, attitudes, and viewpoints in text.
Topic Classification	Topic classification ability involves identifying, analyzing, and categorizing texts, articles, or other information content, enabling effective organization and retrieval based on different themes or domains.
Reading Comprehension	Reading comprehension ability includes skills such as extracting key information from text, understanding the main idea, analyzing reasoning, critical evaluation, and integrating and applying knowledge.
Creative Writing: Creative writing ability refers to the ability to generate unique and valuable new works or ideas through innovative thinking and knowledge skills, capable of automatically producing creative, coherent, and highly readable written works.
Literary Creation	Literary creation is the process of expressing thoughts, emotions, and imagination through words, creating unique artistic works such as stories, poems, and plays.
Creative Copywriting	Creative copywriting is a form of artistic creation that uses unique thinking and verbal expression to provide captivating advertising and promotional messages for brands, products, or services, attracting audience attention and stimulating desire to purchase.
Professional Documentation	Professional documentation refers to high-quality, accurate, and rigorous written materials tailored to specific fields or topics, used to convey professional knowledge, opinions, or information, such as news reports, product descriptions, work reports, etc.
Style Transfer	Text style transfer is the ability to transfer the style or semantic features of a given text to another text, such as converting narrative tone or simulating the narrative style of different authors.
Rewriting and Expansion	Rewriting and expansion involve extending the plot, creativity, or theme based on existing text, creating a longer or more complete work.
Logical Reasoning: Logical reasoning ability refers to the ability to deduce new information or conclusions in a rational and systematic manner through the analysis and understanding of known facts or information.
Mathematics	Mathematics ability refers to the ability to understand and solve mathematical problems, including performing basic mathematical operations, understanding and applying mathematical concepts, and engaging in complex mathematical reasoning and problem-solving.
Critical Thinking	Critical thinking ability refers to the model's ability to analyze, criticize, and reason when receiving information or questions, in order to generate answers or viewpoints with depth and insight. This requires the model to understand complex concepts and problems and simulate human thinking processes to some extent. Additionally, the model needs to effectively understand and identify fallacious or logically confusing questions.
Analysis	Analytical ability refers to the model's ability to deeply understand the input information, identify patterns and structures within it, process them through inference and logic, and generate insightful and structured outputs or answers. Typical analysis problems involve complex logical reasoning processes.
Code Programming: Code programming ability refers to the ability to understand, design, and implement computer programs, including familiarity with programming languages, logical thinking, problem-solving, and code organization skills.
Code Generation	Code generation refers to the process of generating program source code in a specific language, such as Python, C, Java, Go, SQL, based on a given problem description.
Code Debugging	Code debugging is the process of inspecting, identifying, and correcting errors, defects, or unexpected behavior in programming code.
Code Explanation	Code explanation involves understanding the program's source code and providing a natural language description that explains the code's functionality and operations or provides comments for code segments.
Code Optimization	Code optimization is the process of modifying and adjusting program code to improve runtime efficiency, reduce memory usage, decrease program complexity, and enhance readability.
Job Skills: Job skills refer to the knowledge, abilities, and experience required to effectively accomplish specific tasks in a professional field.
Organizational Planning	Organizational planning is the process of systematically planning and designing the goals, processes, resources, and participants of an activity or project.
Marketing Operations	Marketing operations involve targeted planning, execution, and optimization of an organization's marketing activities to enhance brand awareness, increase customer engagement, and drive sales performance.
Collaboration and Communication	Collaboration and communication are interactive ways of effectively sharing information, perspectives, and suggestions within a team to achieve common goals and problem-solving.
Design and Creativity	Design and creativity involve the process of transforming ideas, concepts, and requirements into practical visual or functional solutions through creativity, skills, and aesthetic sensibilities.
Tool Usage: Integration of large models with external tools, ensuring proper interaction through appropriate interfaces and data formats, and enabling the model to understand and utilize the functionality and outputs provided by external tools.
Search Engines	Integration of large models with search engines, allowing the model to understand and utilize the functionality of search engines by taking search queries as input and parsing and processing search results to obtain relevant information.
Computational Tools	Integration of large models with computational tools, enabling the model to invoke computational tools such as calculators, WolframAlpha, etc., through appropriate interfaces and data exchange methods to perform tasks like calculations, analysis, or simulations.
Personality Traits: Personality traits refer to the enduring and stable patterns of thinking, emotions, and behaviors of AGI, which constitute their character and influence their responses and interactions in various social environments.
Security	The answers generated by large models may pose security risks, including but not limited to leaking sensitive information, producing inaccurate or misleading information, generating inappropriate or harmful content, or being maliciously exploited for creating false information or conducting cyberattacks.
Bias	The answers generated by large models may align with biases present in their training data, which may involve biases related to race, gender, age, religion, socioeconomic status, among other aspects, and may lead to unfair or discriminatory conclusions.
Compliance	Compliance capability of a large model refers to its ability to generate outputs that meet specific requirements and objectives under user guidance and defined constraints. This involves the model's accuracy in understanding instructions and adaptability in execution to fulfill application needs in different scenarios.

Source of Evaluation Questions:

We have collected the sources of each question, as shown in the table below:

Source	BELLE eval set	Ten Thousand Whys	WikiHow	Weak IQ Baidu Tieba	Others
Amount	1000	30	20	24	17

Frequency Statistics of Each Class of Questions

We have counted the frequency of questions in each category, as shown in the table below:

First Level Class	Second Level Class	Number of questions	Number of questions
Code Programming	Code Generation	27	42
	Code Optimization	1
	Code Explanation	8
	Code Correction	6
General Knowledge	Social and Humanities	111	457
	Natural Science	155
	Medical and Health	35
	Everyday Life	75
	Timely Information	13
	Technology	40
	Economy	8
Creative Writing	Creative Copywriting	38	198
	Literary Creation	24
	Professional Documentation	44
	Rewriting and Expansion	17
	Style Transfer	75
Language Understanding	Grammar and Vocabulary	22	202
	Summarization	46
	Translation	39
	Sentiment Analysis	11
	Topic Classification	29
	Reading Comprehension	27
	Information Extraction	28
Logical Reasoning	Mathematics	74	162
	Analysis	27
	Critical Thinking	61
Job Skills	Collaboration and Communication	4	27
	Marketing Operations	11
	Organizational Planning	9
	Design and Creativity	3
Personality Traits	Bias	2	3
	Compliance	1
	Security	0
Tool Usage	Search Engines	0	0
Tool Usage	Computational Tools	0	0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_en.md

README_en.md

Bulb Project Vision

How to Evaluate the Abilities of Large Models:

Format of Evaluation Questions:

Format of Answers:

How to Score Answers from Different LLMs:

From which aspects to evaluate large models:

Source of Evaluation Questions:

Frequency Statistics of Each Class of Questions

Files

README_en.md

Latest commit

History

README_en.md

File metadata and controls

Bulb Project Vision

How to Evaluate the Abilities of Large Models:

Format of Evaluation Questions:

Format of Answers:

How to Score Answers from Different LLMs:

From which aspects to evaluate large models:

Source of Evaluation Questions:

Frequency Statistics of Each Class of Questions