Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions

2025-09-05
출판일: 2019-01-07
저자: Rui Wang, Joel Lehman, Jeff Clune, Kenneth O. Stanley

Abstract

While the history of machine learning so far largely encompasses a series of problems posed by researchers and algorithms that learn their solutions, an important question is whether the problems themselves can be generated by the algorithm at the same time as they are being solved. Such a process would in effect build its own diverse and expanding curricula, and the solutions to problems at various stages would become stepping stones towards solving even more challenging problems later in the process. The Paired Open-Ended Trailblazer (POET) algorithm introduced in this paper does just that: it pairs the generation of environmental challenges and the optimization of agents to solve those challenges. It simultaneously explores many different paths through the space of possible problems and solutions and, critically, allows these stepping-stone solutions to transfer between problems if better, catalyzing innovation. The term open-ended signifies the intriguing potential for algorithms like POET to continue to create novel and increasingly complex capabilities without bound. Our results show that POET produces a diverse range of sophisticated behaviors that solve a wide range of environmental challenges, many of which cannot be solved by direct optimization alone, or even through a direct-path curriculum-building control algorithm introduced to highlight the critical role of open-endedness in solving ambitious challenges. The ability to transfer solutions from one environment to another proves essential to unlocking the full potential of the system as a whole, demonstrating the unpredictable nature of fortuitous stepping stones. We hope that POET will inspire a new push towards open-ended discovery across many domains, where algorithms like POET can blaze a trail through their interesting possible manifestations and solutions.

arxiv.org/abs/1901.01753

Introduction

Just hard enough, but not so hard problems:

… the job of the algorithm could be to conceive both the challenges and the solutions at the same time. Such a process offers the novel possibility that the march of progress, guided so far by a sequence of problems conceived by humans, could lead itself forward, pushing the boundaries of performance autonomously and indefinitely. Such a process offers the novel possibility that the march of progress, guided so far by a sequence of problems conceived by humans, could lead itself forward, pushing the boundaries of performance autonomously and indefinitely. In effect, such an algorithm could continually invent new environments that pose novel problems just hard enough to challenge current capabilities, but not so hard that all gradient is lost.

넓은 의미에서의 Biomimicry:

In fact, the only process ever actually to achieve intelligence at the human level, natural evolution, is just such a self-contained and open-ended curriculum-generating process. Both the problems of life, such as reaching and eating the leaves of trees for nutrition, and the solutions, such as giraffes, are the products of the same open-ended process. And this process unfolds not as a single linear progression, but rather involves innumerable parallel and interacting branches radiating for more than a billion years (and is still going).

Transfers among differnt environments:

A key opportunity afforded by this approach is to attempt transfers among different environments. That is, the solution to one environment might be a stepping stone to a new level of performance in another, which reflects our uncertainty about the stepping stones that trace the ideal curriculum to any given skill

Background

(omitted)

The Paired Open-Ended Trailblazer (POET) Algorithm

Trailblazer:

It also elaborates on the minimal criterion in MCC by aiming to maintain only those newly-generated environments that are not too hard and not too easy for the current population of agents. The result is a trailblazer algorithm, one that continually forges new paths to both increasing challenges and skills within a single run.

The fundamental algorithm:

The fundamental algorithm of POET is simple: The idea is to maintain a list of active environmentagent pairs EA_List that begins with a single starting pair ( ${E^\text{init}}(\cdot), \theta^\text{init}$ ), where $E^\text{init}$ is a simple environment (e.g. an obstacle course of entirely flat ground) and $\theta^\text{init}$ is a randomly initialized weight vector (e.g. for a neural network). POET then has three main tasks that it performs at each iteration of its main loop:

generating new environments $E(\cdot)$ from those currently active.

optimizing paired agents within their respective environments, and

attempting to transfer current agents $\theta$ from one environment to another.

Generating new environments is how POET continues to produce new challenges. To generate a new environment, POET simply mutates (i.e. randomly perturbs) the encoding (i.e. the parameter vector) of an active environment. However, while it is easy to generate perturbations of existing environments, the delicate part is to ensure both that (1) the paired agents in the originating (parent) environments have exhibited sufficient progress to suggest that reproducing their respective environments would not be a waste of effort, and (2) when new environments are generated, they are not added to the current population of environments unless they are neither too hard nor too easy for the current population. Furthermore, priority is given to those candidate environments that are most novel, which produces a force for diversification that encourages many different kinds of problems to be solved in a single run.

Experiment Setup And Results

The “Curriculum builder” Interpretation:

One interpretation of POET’s ability to create agents that can solve challenging problems is that it is in effect an automatic curriculum builder. Building a proper curriculum is critical for learning to master tasks that are challenging to learn from scratch due to a lack of informative gradient. However, building an effective curriculum given a target task is itself often a major challenge. Because newer environments in POET are created through mutations of older environments and because POET only accepts new environments that are not too easy not too hard for current agents, POET implicitly builds a curriculum for learning each environment it creates. The overall effect is that it is building many overlapping curricula simultaneously, and continually checking whether skills learned in one branch might transfer to another.

“Direct-path curriculum-building” as a control algorithm:

A natural question then is whether the environments created and solved by POET can also be solved by an explicit, direct-path curriculum-building control algorithm. … In this control, the agent is progressively trained on a sequence of environments of increasing difficulty that move towards the target environment. …

The results … demonstrate that POET creates and solves environments that the control algorithm fails to solve at very and extremely challenging difficulty levels, while the curriculum-based control algorithm can often (though not always) solve environments at the lowest challenge level.

Lack of necessary stepping stones:

A fundamental problem of a pre-conceived direct-path curriculum (like the control algorithm above) is the potential lack of necessary stepping stones. In particular, skills learned in one environment can be useful and critical for learning in another environment. Because there is no way to predict where and when stepping stones emerge, the need arises to conduct transfer experiments (which POET implements) from differing environments or problems. (“푸는 과정”을 거치는 와중에 해결책의 조각들이 발견되는 종류의 문제들이 있다. 과정에 담긴 가치 —ak)

Discussion, Future Work, and Conclusion

POET as a Meta-learning algorithm:

POET could also substantially drive progress in the field of meta-learning, wherein neural networks are exposed to many different problems and get better over time at learning how to solve new challenges (i.e. they learn to learn).

2025-09-05 메모

교훈? 어려운 문제를 풀려면 중간 단계의 문제들(stepping stones)이 필요한데, 어떤 중간 단계의 문제들을 푸는 게 필요한지를 사전에 예측하거나 열거할 수 없음(위 논문의 대조 알고리즘의 성능이 낮은 결정적 이유. 과정에 담긴 가치?). 따라서 문제를 풀어가는 와중에 다양한 중간 단계의 문제들을 발견해가고 여기에서 배운 걸 전이(tansfer)하는 과정이 중요.
위 논문에서는 커리큘럼 빌더가 있고(E-init에서 출발해서 적절한 난이도의 다음 E를 점진적으로 전개하는 알고리즘) 이 각각의 E에서 학습을 수행하는 에이전트들이 있는데, 이 둘을 합치면(즉, 에이전트가 자기에게 필요한 학습 환경을 스스로 구축하게 하면) Andy Clark이 말하는 정보 자가구축 개념이랑도 연결되겠다. 그러려면 에이전트가 힘을 가해서 이미 생성된 환경을 변형할 수 있는 메커니즘(장애물을 움직이거나 땅을 파거나 등)이 추가되어야 함.