Feature Reinforcement Learning: Part I. Unstructured MDPs

Feature Reinforcement Learning: Part I. Unstructured MDPs

General-purpose, intelligent, learning agents cycle through sequences of observations, actions, and rewards that are complex, uncertain, unknown, and non-Markovian. On the other hand, reinforcement learning is well-developed for small finite state Markov decision processes (MDPs). Up to now, extracting the right state representations out of bare observations, that is, reducing the general agent setup to the MDP framework, is an art that involves significant effort by designers. The primary goal of this work is to automate the reduction process and thereby significantly expand the scope of many existing reinforcement learning algorithms and the agents that employ them. Before we can think of mechanizing this search for suitable MDPs, we need a formal objective criterion. The main contribution of this article is to develop such a criterion. I also integrate the various parts into one learning algorithm. Extensions to more realistic dynamic Bayesian networks are developed in Part II (Hutter, 2009c). The role of POMDPs is also considered there.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • Aarts, E. H. L., and Lenstra, J. K., eds. 1997. Local Search in Combinatorial Optimization. Discrete Mathematics and Optimization. Chichester, England: Wiley-Interscience.

  • Banzhaff, W.; Nordin, P.; Keller, E.; and Francone, F. 1998. Genetic Programming. San Francisco, CA, U.S.A.: Morgan-Kaufmann.

  • Barron, A. R. 1985. Logically Smooth Density Estimation. Ph.D. Dissertation, Stanford University.

  • Berry, D. A., and Fristedt, B. 1985. Bandit Problems: Sequential Allocation of Experiments. London: Chapman and Hall.

  • Brafman, R. I., and Tennenholtz, M. 2002. R-max - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning. Journal of Machine Learning Research 3:213-231.

  • Cover, T. M., and Thomas, J. A. 2006. Elements of Information Theory. Wiley-Intersience, 2nd edition.

  • Dearden, R.; Friedman, N.; and Andre, D. 1999. Model based Bayesian Exploration. In Proc. 15th Conference on Uncertainty in Artificial Intelligence (UAI-99), 150-159.

  • Duff, M. 2002. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. Ph.D. Dissertation, Department of Computer Science, University of Massachusetts Amherst.

  • Dzeroski, S.; de Raedt, L.; and Driessens, K. 2001. Relational Reinforcement Learning. Machine Learning 43:7-52.

  • Fishman, G. 2003. Monte Carlo. Springer.

  • Givan, R.; Dean, T.; and Greig, M. 2003. Equivalence Notions and Model Minimization in Markov Decision Processes. Artificial Intelligence 147(1-2):163-223.

  • Goertzel, B., and Pennachin, C., eds. 2007. Artificial General Intelligence. Springer.

  • Gordon, G. 1999. Approximate Solutions to Markov Decision Processes. Ph.D. Dissertation, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.

  • Grünwald, P. D. 2007. The Minimum Description Length Principle. Cambridge: The MIT Press.

  • Guyon, I., and Elisseeff, A., eds. 2003. Variable and Feature Selection. JMLR Special Issue: MIT Press.

  • Hastie, T.; Tibshirani, R.; and Friedman, J. H. 2001. The Elements of Statistical Learning. Springer.

  • Hutter, M. 2005. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Berlin: Springer. 300 pages, http://www.hutter1.net/ai/uaibook.htm. http://www.hutter1.net/ai/uaibook.htm

  • Hutter, M. 2007. Universal Algorithmic Intelligence: A Mathematical TopDown Approach. In Artificial General Intelligence. Berlin: Springer. 227-290.

  • Hutter, M. 2009a. Feature Dynamic Bayesian Networks. In Proc. 2nd Conf. on Artificial General Intelligence (AGI'09), volume 8, 67-73. Atlantis Press.

  • Hutter, M. 2009b. Feature Markov Decision Processes. In Proc. 2nd Conf. on Artificial General Intelligence (AGI'09), volume 8, 61-66. Atlantis Press.

  • Hutter, M. 2009c. Feature Reinforcement Learning: Part II: Structured MDPs. In progress. Will extend Hutter (2009a).

  • Kaelbling, L. P.; Littman, M. L.; and Cassandra, A. R. 1998. Planning and Acting in Partially Observable Stochastic Domains. Artificial Intelligence 101:99-134.

  • Kearns, M. J., and Singh, S. 1998. Near-optimal reinforcement learning in polynomial time. In Proc. 15th International Conf. on Machine Learning, 260-268. Morgan Kaufmann, San Francisco, CA.

  • Koza, J. R. 1992. Genetic Programming. The MIT Press.

  • Kumar, P. R., and Varaiya, P. P. 1986. Stochastic Systems: Estimation, Identification, and Adaptive Control. Englewood Cliffs, NJ: Prentice Hall.

  • Legg, S., and Hutter, M. 2007. Universal Intelligence: A Definition of Machine Intelligence. Minds & Machines 17(4):391-444.

  • Legg, S. 2008. Machine Super Intelligence. Ph.D. Dissertation, IDSIA, Lugano.

  • Li, M., and Vitányi, P. M. B. 2008. An Introduction to Kolmogorov Complexity and its Applications. Berlin: Springer, 3rd edition.

  • Liang, P., and Jordan, M. 2008. An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators. In Proc. 25th International Conf. on Machine Learning (ICML'08), volume 307, 584-591. ACM.

  • Liu, J. S. 2002. Monte Carlo Strategies in Scientific Computing. Springer.

  • Lusena, C.; Goldsmith, J.; and Mundhenk, M. 2001. Nonapproximability Results for Partially Observable Markov Decision Processes. Journal of Artificial Intelligence Research 14:83-103.

  • MacKay, D. J. C. 2003. Information theory, inference and learning algorithms. Cambridge, MA: Cambridge University Press.

  • Madani, O.; Hanks, S.; and Condon, A. 2003. On the Undecidability of Probabilistic Planning and Related Stochastic Optimization Problems. Artificial Intelligence 147:5-34.

  • McCallum, A. K. 1996. Reinforcement Learning with Selective Perception and Hidden State. Ph.D. Dissertation, Department of Computer Science, University of Rochester.

  • Ng, A. Y.; Coates, A.; Diel, M.; Ganapathi, V.; Schulte, J.; Tse, B.; Berger, E.; and Liang, E. 2004. Autonomous Inverted Helicopter Flight via Reinforcement Learning. In ISER, volume 21 of Springer Tracts in Advanced Robotics, 363-372. Springer.

  • Pankov, S. 2008. A Computational Approximation to the AIXI Model. In Proc. 1st Conference on Artificial General Intelligence, volume 171, 256-267.

  • Pearlmutter, B. A. 1989. Learning State Space Trajectories in Recurrent Neural Networks. Neural Computation 1(2):263-269.

  • Poland, J., and Hutter, M. 2006. Universal Learning of Repeated Matrix Games. In Proc. 15th Annual Machine Learning Conf. of Belgium and The Netherlands (Benelearn'06), 7-14.

  • Poupart, P.; Vlassis, N. A.; Hoey, J.; and Regan, K. 2006. An Analytic Solution to Discrete Bayesian Reinforcement Learning. In Proc. 23rd International Conf. on Machine Learning (ICML'06), volume 148, 697-704. Pittsburgh, PA: ACM.

  • Puterman, M. L. 1994. Markov Decision Processes — Discrete Stochastic Dynamic Programming. New York, NY: Wiley.

  • Raedt, L. D.; Hammer, B.; Hitzler, P.; and Maass, W., eds. 2008. Recurrent Neural Networks - Models, Capacities, and Applications, volume 08041 of Dagstuhl Seminar Proceedings. IBFI, Schloss Dagstuhl, Germany.

  • Ring, M. 1994. Continual Learning in Reinforcement Environments. Ph.D. Dissertation, University of Texas, Austin.

  • Ross, S., and Pineau, J. 2008. Model-Based Bayesian Reinforcement Learning in Large Structured Domains. In Proc. 24th Conference in Uncertainty in Artificial Intelligence (UAI'08), 476-483. Helsinki: AUAI Press.

  • Ross, S.; Pineau, J.; Paquet, S.; and Chaib-draa, B. 2008. Online Planning Algorithms for POMDPs. Journal of Artificial Intelligence Research 2008(32):663-704.

  • Russell, S. J., and Norvig, P. 2003. Artificial Intelligence. A Modern Approach. Englewood Cliffs, NJ: Prentice-Hall, 2nd edition.

  • Sanner, S., and Boutilier, C. 2009. Practical Solution Techniques for First-Order MDPs. Artificial Intelligence 173(5-6):748-788.

  • Schmidhuber, J. 2004. Optimal Ordered Problem Solver. Machine Learning 54(3):211-254.

  • Schwarz, G. 1978. Estimating the Dimension of a Model. Annals of Statistics 6(2):461-464.

  • Singh, S.; Littman, M.; Jong, N.; Pardoe, D.; and Stone, P. 2003. Learning Predictive State Representations. In Proc. 20th International Conference on Machine Learning (ICML'03), 712-719.

  • Strehl, A. L.; Diuk, C.; and Littman, M. L. 2007. Efficient Structure Learning in Factored-State MDPs. In Proc. 27th AAAI Conference on Artificial Intelligence, 645-650. Vancouver, BC: AAAI Press.

  • Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.

  • Szita, I., and Lörincz, A. 2008. The Many Faces of Optimism: a Unifying Approach. In Proc. 12th International Conference (ICML 2008), volume 307.

  • Wallace, C. S. 2005. Statistical and Inductive Inference by Minimum Message Length. Berlin: Springer.

  • Willems, F. M. J.; Shtarkov, Y. M.; and Tjalkens, T. J. 1997. Reections on the Prize Paper: The Context-Tree Weighting Method: Basic Properties. IEEE Information Theory Society Newsletter 20-27.

  • Wolpert, D. H., and Macready, W. G. 1997. No Free Lunch Theorems for Optimization. IEEE Transactions on Evolutionary Computation 1(1):67-82.


Journal + Issues