# Description of ACS2¶

## History¶

### ACS (1997)¶

First ALCS (ACS) was developed by Stolzman in 1997 (with puts an additional anticipatory part in each classifier). It is a new kind of classifier system that learn by using the cognitive mechanism of anticipatory behavioral control (introduced in cognitive psychology by Hoffman).

Anticipatory Behavioral Control

In 1992 Hoffman formulated a learning mechanism with basic assumption that a decisive factor for purposive behavior is the anticipation of the behavior consequences. Behavior consequences usually depend on the situation in which the behavior is executed. So it is necessary to learn in which situation S which behavior R (reaction) leads to which effects E.

### CXCS (2000)¶

Tomlison and Bull (2000) published a cooperate learning classifier system (CXCS) in which cooperations between rules allow anticipatory proceses.

### YACS (2001)¶

Another ALCS with an explicit anticipatory part YACS, has been published in Gerard and Siguad (2001).

### ACS2¶

ACS2 (derived from ACS) is intendet to create a solution that is complete, accurate and maximally general. Major differences between ACS and ACS2: - ACS2 evolves explicit rules for situation-action tuples in which no changes occurs (a pass-through-symbol in E part requires a change in value), - ACS2’s ALP (specialization pressure) and GA (generalization pressure) processes are improved, - starts with initially empty population of classifiers, - modifications are made on the whole action set (not just on the executed classifier),

## Knowledge representation¶

Knowledge in an ACS2 is represented by a population of classifiers. Each classifier represents a condition-action-effect that anticipates the model state resulting from the execution of the action given the specified conditions.

A classifier in ACS2 always specifes a complete resulting state.

It consists of the following main components: - condition part (C) - specifies the set of situations in which a classifier is applicable, - action part (A) - proposes an available action, - effect part (E) - anticipates the effects of the proposed action in the specific conditions, - quality (q) - measures the accuracy of the anticipated results, $$q \in [0,1]$$, - reward prediction (r) - estimates the reward encountered after the execution of action A in condition C, $$r \in \!R$$ - immediate reward prediction (ir) - estimates the direct reinforcement encountered after execution of action A in condition C, $$ir \in \!R$$

Classifier fitness score is calculated using quality and reward - $$fitness(cl) = cl.q \cdot cl.r$$

The condition and effect part consist of the values perceived from the environment and # symbols (i.e. $$C,E \in \{ l_1, l_2, \dots, l_m, \# \}^L$$.

A #-symbol in the: - condition part is called “don’t care” and denotes that the classifier matches any value in this attribute, - effect part is called “pass-through” specifies that the classifier anticipates that the value of this attribute will not change after the execution of the specified action

Non pass-through symbols in E anticipate the change of the particular attribute to the specified value (in contrast to ACS in which a non pass-through symbol did not require a change in value).

Additionally each classifier compromises: - Mark (M) - records the values of each attribute of all situations in which the classifer did not anticipate correctly sometimes, - GA timestamp (:math:t_{ga}) - timestamp when GA was last applied, - ALP timestamp (:math:t_{alp}) - timestamp when ALP was last applied, - application average (aav) - estimates the frequency a classifier is updated (i.e. part of an action set), - experience counter (exp) - counts the number of applications, - numerosity (num) - denotes the number of micro-classifers this macroclassifier represents (one classifier may represent many identical micro-classifier)

## Agent interaction¶

ACS2 interacts autonomously with an environment.

In a behavioral act at a certain time $$t$$, the agent perceives a situation $$\sigma(t) = \{ l_1, l_2, \dots, l_m \}^L$$, where:

• $$m$$ denotes the number of possible values of each environmental attribute (or feature),

• $$l_1, l_2, \dots, l_n$$ denote the different possible values for each attribute,

• $$L$$ denotes the string length.

Note that each attribute can only take discrete values.

The system can act upon the environment with an action $$\alpha(t) = \{ \alpha_1, \alpha_2, \dots, \alpha_n \}$$, where:

• $$n$$ specifies the number of different possible actions in in the environment,
• $$\alpha_1, \alpha_2, \dots, \alpha_n$$ denote the different possible actions

After the execution of an action, the environment provides a scalar reinforcement value $$\rho(t) \in \mathbb{R}$$

## Environmental Model¶

By interacting with the environemnt the ACS2 learns about it’s structure. Usually the agent starts without any prior knowledge. Initially new classifiers are mainly generated by a covering mechanizm in ALP. Later the ALP generates specialized clasifiers while the GG tries to introduce some genetic generalization.

Figure below presents the interaction with greater details. 1. After the perception of the current situation $$\sigma(t)$$, ACS2 forms a match set [M] comprising all classifiers in the population [P] whose conditions are satisfied in $$\sigma(t)$$, 2. ACS2 chooses an action $$\alpha(t)$$ according to some strategy (see below), 3. With respect to the chosen action, an action set [A] is generated that consist of all classifiers in [M] that specify the chosen action $$\alpha(t)$$, 4. After the execution of $$\alpha(t)$$ classifier parameters are updated by ALP and RL. New classifiers might be added or deleted due to the ALP and GG (see below).

Action selection plays an important role in building the environmental model. Aproaches for dealing with exploration/explotation trade-off can be used. It’s worth mentioning that different approach should be taken for either stationary and non-stationary environments.

Literature: - ”The exploration-exploitation dilemma for adaptive agents” <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.60.5740&rep=rep1&type=pdf>__ ~ Rejeb (exploration/exploitation dillema in XCS) - “Biasing Exploration in an Anticipatory Classifier System” ~ Butz (action delay bias, knowledge array bias) - “Action-Planning in Anticipatory Classifier Systems” ~ Butz, Stolzmann (action planning) - “Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration <http://rstb.royalsocietypublishing.org/content/362/1481/933.short>__” (discussing the aspect of making decisons what examined in depth, gittins index, soft-max) - “ACS2 Model Exploitation” ~ Butz (all techniques above)

Fact: Humans show a greater tendency to explore when there is more time left in a task

Aim is to explore unknown regions in a more directed way. In each time step the exploration can be executed with $$\epsilon$$ probability:

• random (default), > randomly select an action (might be the best or totally bad one),
• action delay bias, > the execution of actions which have been executed quite long ago (according to $$t_{alp}$$) in a given situation promises the detection of new situation-action-effect relations.
• knowledge array bias > calculates the averaged quality for each action from the match-set and selectes the worst one.
• action delay & knowledge array bias > action is chosen randomly using two approaches
• gittins index (NYI) > measure the reward that can be obtained taking into consideration the terminating condition (used in bandit problems). Restricted only to stationary problems. More info here
• soft-max (NYI) > the drawback of random action selection is that it chooses equally amongs all possible actions. Soft-max enables the possibility to vary action probabilities as a graded function of estimated value. More information about this probabilistic approach here
• expected and unexpected uncertanity (NYI) > biological approach by analysing ACh levels and NE signals
• best (default). > selects a classifier with highest fitness value - $$q \cdot r$$ from the match set [M])
• action planning > Knows where the goal is and tries to find the shortes paths with known classifiers. It’s also a good idea to explore unknown paths (not only this to goal).
• other techniques might use ir value (currently not used)

### Anticipatory Learning Process (ALP)¶

The ALP was originally derived from the cognitive theory of anticipatory behavioral control (Hoffman, 1993).

It compares the anticipation of each classifier in an action set with the real next situation $$\sigma(t+1)$$.

The process results in the evaluation and specialization of the anticipatory model in ACS2.

ALP first update classifier parameters. Next an offspring is generated and inaccuate classifiers are removed.

The quality $$q$$ of a generated classifier is set to the parent’s value but never lower than 0.5 since the new classifier is supposedly better than the parental classifier. The reward prediction $$r$$ inherits the value of its parent.

The execution of an action is accompanied by the formation of the action set, which represents the anticipations of the real next situation. Thus, ACS2 satisfies the first point of Hoffmann’s theory of anticipatory behavioral control which states that any behavioral act or response (R) is accompanied with an anticipation of its effects. Moreover, the comparison of the anticipation of each classifier in [A] can be compared to a continuous comparison of anticipations with real next situations as stated in Hoffmann’s second point. The third and fourth point address the consequences of the comparison and are realized in the distinction of an unexpected case and an expected case.

The following parameters are updated in the following order - quality ($$q$$), mark ($$M$$), application average ($$aav$$), ALP timestamp ($$t_{alp}$$) and the experience counter ($$exp$$).

##### Quality¶

The quality $$q$$ is updated according to the classifiers anticipation. If the classifier predicted correctly, the quality is increased using the following (Widrow-Hoff delta rule) formula:

$q \leftarrow q + \beta (1- q)$

Otherwise it is decreased:

\begin{align}\begin{aligned}q \leftarrow q - \beta q\\In the equation, :math:\beta \in [0,1] denotes the **learning rate** of ACS2. The smaller the learning rate, the more passive ACS2 updates its values and the less are the values biased towards recent environmental interactions. On the other hand, the larger the learning rate, the faster the parameters adapt to changes in the environment but also the more noisy are the parameters.\end{aligned}\end{align}
##### Mark¶

Situation $$\sigma(t) = (\sigma_1, \dots, \sigma_L)$$ is added to the mark $$M = (m_i, \dots, m_L)$$ if the classifier did not anticipate correctly. In this case $:nbsphinx-math:forall_i m_i = m_i \cup { :nbsphinx-math:sigma_i}$.

##### Application average¶

Parameter is updated using the “moyenne adaptive modifiee” technique as introduced in Venturini (1994).

\begin{align}\begin{aligned}\begin{split}cl.aav \leftarrow \left\{ \begin{array}{ll} \frac{cl.aav(cl.exp -1) + (t - cl.t_{alp})}{cl.exp} & \text{if}\ cl.exp < \frac{1}{\beta}\\ cl.aav + \beta (t - cl.t_{alp}) & \text{otherwise}\\ \end{array} \right.\end{split}\\ The technique assures the fast adaptation of :math:aav once the classifier is introduced and later assures a continues update according to the overall learning rate :math:\beta. Note, this technique also introduces a possible high factor of noise in the young classifier. Thus, the technique is not applied in the quality updates.\end{aligned}\end{align}
##### ALP timestamp¶

The ALP time stamp is set to the current time t recording the last parameter update in the ALP.

$cl.t_{alp} \leftarrow t$
##### Experience¶

Increment experience by 1

$cl.exp \leftarrow cl.exp + 1$

#### Classifier generation and deletion¶

The ALP generates specialized offspring and/or deletes inaccurate classifiers.

Inaccurate classifiers are classifiers whose quality is lower than the inaccuracy threshold $$\theta_i$$. When the quality of a classifiers falls below $$\theta_i$$ after an update, it is deleted.

More specialized classifiers are generated in two ways.

1. An expected case, in which a classifier anticipated the correct outcome, a classifier might be generated if the mark $$M$$ differs from the situation $$\sigma(t)$$ in some attributes, i.e. $$\exists_{i,j} l_j \in m_i \wedge l_j \neq \sigma_i$$. Since the mark specifies the characteristics of situations in which a classifier did not work correctly, a difference indicates that the specific position might be important to distinguish the correct and wrong outcome case. Thus, the ALP generates an offspring whose conditions are further specialized. If there are unique differences in the mark compared to the current situation, i.e. $$\exists_i \sigma_i \notin m_i$$, then one of the unique difference is specialized in the offspring. However, if there are only positions that differ but $$\sigma_i$$ is always part of $$m_i$$, i.e. $$\forall_i \sigma_i \in m_i$$, then all differing positions are specialized. > The number of specialized positions in the conditions that are not specialized in the effects is limited to $$u_{max}$$.

In other words: - the strength of the classifier is increased, - if there is a mark, then a new classifier is formed that is more specific in the C and E parts, trying to exclude the possibility to be chosen in the state(s) described by mark

1. In an unexpected case, a classifier did not anticipate the correct outcome (one or more predicted changes are incorrect or when one or more components change that were predicted to stay the same).

In this case a classifier: - is marked by the situation $$\sigma(t)$$, - has decreased quality: $$cl.q \leftarrow cl.q - \beta \cdot cl.q$$

In difference with the ACS all classifiers from the action set will become marked in this case.

Also, an offspring classifier may be generated, if the effect part of the old classifier can be further specialized (by changing pass-through symbols to specific values) to specify the perceived outcome correctly. All positions in condition and effect part are specialized that change from $$\sigma(t)$$ to $$\sigma(t+1)$$.

Therefore attributes whose value changed in the environment but are anticipated to stay the same are specified in condition and effect part of the offspring.

In other words: - the classifier gets a mark, - the strength of the classifier is decreased, - if possible, a new classifier is generated that is more specific than the old one and that anticipates the environment correctly.

For both (expected and unexpected) reproduction cases: - the Mark :math:M of the offspring is emptied - :math:cl.M \leftarrow \emptyset, - the experience counter :math:exp is set to zero - :math:cl.exp \leftarrow 0, - ALP and GA time stamp are set to the current time t - :math:cl.t_{alp}, cl.t_{ga} \leftarrow t, - the numerosity is set to one - :math:cl.num \leftarrow 0, - all other parameters are inherited from the parents. If the generated offspring already exists in the population [P], the offspring is discarded and the quality q of the old classifier is increased applying equation 1.

A classifier is also generated if there was no other classifier in the actual action set [A] that anticipated the effect correctly. In this case, a covering classifier is generated that is specialized in all attributes in condition and effect part that changed from $$\sigma(t)$$ to $$\sigma(t+1)$$. > The covering method was not applied in ACS since in ACS a completely general classifiers was always present for each action.

The attributes of the Mark $$M$$ of the covering classifier are initially empty. Quality $$q$$ is set to 0.5 as well as the reward prediction $$r$$, while the immediate reward prediction $$ir$$ as well as the application average $$avv$$ are set to 0. The time stamps are set to the current time $$t$$.

\begin{align}\begin{aligned}\begin{split}cl.M \leftarrow \emptyset \\ cl.q \leftarrow 0.5 \\ cl.r \leftarrow 0.5 \\ cl.ir \leftarrow 0 \\ cl.aav \leftarrow 0 \\ cl.t_{alp} \leftarrow t \\ cl.t_{ga} \leftarrow t\end{split}\\ It is important to mention that both adding and removing classifiers **does not influence the ongoing ALP application**.\end{aligned}\end{align}

### Genetic Generalization (GG)¶

While the ALP specializes classifiers in a quite competent way, over-specializations can occur sometimes as studied in (Butz, 2001). Since the over-specialization cases can be caused by various circumstances, a genetic generalization (GG) mechanism was applied that, interacting with the ALP, results in the evolution of a complete, accurate, and maximally general model.

The mechanism starts after applying ALP module, and looks as follow:

1. Determine if GG should be applied. GG is applied if the average time since last GG application in the current action set [A] is larger than the threshold $$\theta_{ga}$$.

$t - \frac{\sum_{cl \in [A]} cl.t_{ga} cl.num}{\sum_{cl \in [A]} cl.num} > \theta_{ga}$
2. If the mechanism is applied update the application time for all classifiers $$\forall_{cl \in [A]} cl.t_{ga} \leftarrow t$$,

3. Select two classifiers using “roulette-wheel selection”, with respect to their qualities $$q$$.

4. Reproduce two classifiers, by removing marks and halving the qualities.

$\begin{split}cl_1.M \leftarrow \emptyset \\ cl_2.M \leftarrow \emptyset \\ cl_1.q \leftarrow \frac{cl_1.q}{2} \\ cl_2.q \leftarrow \frac{cl_2.q}{2}\end{split}$
5. Mutate classifiers, by applying a “generalizing mutation”. >Generalizing mutation is only mutating specified attributes in the condition part C back to don’t care # symbols. A specialized attribute is generalized with a probability $$\mu$$. Moreover, conditions of the offspring are crossed applying two-point crossover with a probability of $$\chi$$. In the case of a crossover application, quality, reward prediction, and immediate reward prediction are averaged over the offspring.

6. Insert new offspring classifiers into [P] and [A] > If a generated offspring already exists in the population, the offspring classifier is discarded and if the existing classifier is not marked its numerosity is increased by one.

$\begin{split}cl.M \leftarrow \emptyset \\ cl.num \leftarrow cl.num + 1\end{split}$

The GG mechanism also applies a deletion procedure inside the action set. If an action set [A] exceeds the action set size threshold $$\theta_{as}$$, excess classifiers are deleted in [A]. The procedure applies a modified “tournament selection” process in which the classifier with the significant lowest quality, or the classifier with the highest specificity is deleted. Thus, deletion causes the extinction of low-quality as well as over-specialized classifiers.

Aproximatelly a third of the action set size takes part in the tournament.

### Reinforcement Learning (RL)¶

RL approach in ACS2 adapts the Q-learning (Watkins, 1989; Watkins & Dayan, 1992) idea (away from the traditional bucket brigade algorithm).

An environmental model needs to be specific enough to represent a proper behavioral policy (rewards) in the model. If this is not the case, “model aliasing” might take place.

In “model aliasing” a model is completely accurate in terms of anticipations but over-general with respect to reinforcement.

In order to learn an optimal behavioral policy in ACS2, parameters $$r$$ and $$ir$$ are continuously updated after executing an action and obtaining next environmental perception and reward - $$\rho(t)$$.

$\begin{split}r \leftarrow r + \beta (\rho(t) + \gamma \max_{cl \in [M](t+1) \wedge cl.E \neq \{\#\}^L} (cl.q \cdot cl.r) - r) \\ ir \leftarrow ir + \beta (\rho(t) - ir)\end{split}$

As before $$\beta \in [0,1]$$ denotes the “learning rate” biasing the parameters more or less towards recently encountered reward. $$\gamma \in [0,1)$$ denotes the “discount factor”.

The values of $$r$$ and $$ir$$ consequently specify an average of the resulting reward after the execution of action A over all possible situations of the environment in which the classifier is applicable.

### Other aspects¶

Subsumption

If an offspring classifier was generated (regardless if by ALP or GG), the set is searched for a subsuming classifier.

The offspring is subsumed if a classifier: - has more general condition part, - has identical effect part, - is reliable (its quality is higher than the threshold $$\theta_r$$), - is not marked, - is experienced (its experience counter $$exp$$ is higher than the threshold $$\theta_{exp}$$).

If there are more than one possible subsumer, the syntactically maximally general subsumer is chosen. In the case of a draw, the subsumer is chosen at random from the maximally general ones. If a subsumer was found, the offspring is discarded and either quality or numerosity is increased dependent on if the offspring was generated by ALP or GG, respectively.

Interaction of ALP and GG.

Several distinct studies in various environments revealed that the interaction of ALP and GG is able to evolve a complete, accurate, and maximally general model in various environments in a competent way (see e.g. Butz, Goldberg, and Stolzmann, 2000, Butz, 2001). The basic idea behind the interacting model learning processes is that the specialization process extracts as much information as possible from the encountered environment continuously specializing over-general classifiers. The GG mechanism, on the other hand, randomly generalizes exploiting the power of a genetic algorithm where no more additional information is available from the environment. The ALP ensures diversity and prevents the loss of information of a particular niche in the environment. Only GG generates identical classifiers and causes convergence in the population.

### Sources¶

• “Anticipatory Classifier Systems” - Wolfgang Stolzmann
• “Biasing Exploration in an Anticipatory Learning Classifier System” - Martin V. Butz