We present a primal sub-gradient method for structured SVM optimization defined with the averaged sum of hinge losses inside each example. Compared with the mini-batch version of the Pegasos algorithm for the structured case, which deals with a single structure from each of multiple examples, our algorithm considers multiple structures from a single example in one update. This approach should increase the amount of information learned from the example. We show that the proposed version with the averaged sum loss has at least the same guarantees in terms of the prediction loss as the stochastic version. Experiments are conducted on two sequence labeling problems, shallow parsing and part-of-speech tagging, and also include a comparison with other popular sequential structured learning algorithms.
Bach, F. and Moulines, E. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning, in J. Shawe-Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira and K.Q. Weinberger (Eds.), Advances in Neural Information Processing Systems (NIPS), Curran Associates, Inc., Red Hook, NY, pp. 451-459.
Balamurugan, P., Shevade, S., Sundararajan, S. and Keerthi, S.S. (2011). A sequential dual method for structural SVMs, SDM 2011-Proceedings of the 11th SIAM International Conference on Data Mining, Mesa, AZ, USA.
Bottou, L. (2008). SGD implementation, http://leon.bottou.org/projects/sgd.
Boyd, S. and Vandenberghe, L. (2004). Convex Optimization, Cambridge University Press, New York, NY.
Collins, M. (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms, Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, Vol. 10, Association for Computational Linguistics, Stroudsburg, PA, pp. 1-8.
Collins, M., Globerson, A., Koo, T., Carreras, X. and Bartlett, P.L. (2008). Exponentiated gradient algorithms for conditional random fields and max-margin Markov networks, Journal of Machine Learning Research 9: 1775-1822.
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S. and Singer, Y. (2006). Online passive-aggressive algorithms, Journal of Machine Learning Research 7: 551-585.
Crammer, K., McDonald, R. and Pereira, F. (2005). Scalable large-margin online learning for structured classification NIPSWorkshop on Learning with Structured Outputs, Vancouver/ Whistler, Canada.
Daume, III, H.C. (2006). Practical Structured Learning Techniques for Natural Language Processing, Ph.D. thesis, University of Southern California, Los Angeles, CA.
Do, C.B., Le, Q.V., Teo, C.H., Chapelle, O. and Smola, A.J. (2008). Tighter bounds for structured estimation, in D. Koller (Ed.), Advances in Neural Information Processing Systems, Curran Associates, Inc., Red Hook, NY, pp. 281-288.
Gimpel, K. and Smith, N.A. (2010). Softmax-margin CRFs: Training log-linear models with cost functions, Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, CA, USA, pp. 733-736.
Jaggi, M., Lacoste-Julien, S., Schmidt, M. and Pletscher, P. (2012). Block-coordinate Frank-Wolfe for structural SVMS, NIPS Workshop on Optimization for Machine Learning, Lake Tahoe, NV, USA.
Joachims, T., Finley, T. and Yu, C.-N.J. (2009). Cutting-plane training of structural SVMs, Machine Learning 77(1): 27-59.
Lafferty, J.D., McCallum, A. and Pereira, F.C.N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data, Proceedings of the 18th International Conference on Machine Learning, ICML’01, San Francisco, CA, USA, pp. 282-289.
Lee, C., Ryu, P.-M. and Kim, H. (2011). Named entity recognition using a modified Pegasos algorithm, Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Glasgow, UK, pp. 2337-2340.
Li, M., Lin, L., Wang, X. and Liu, T. (2007). Protein-protein interaction site prediction based on conditional random fields, Bioinformatics 23(5): 597-604.
Lim, S., Lee, C. and Ra, D. (2013). Dependency-based semantic role labeling using sequence labeling with a structural SVM, Pattern Recognition Letters 34(6): 696-702.
Martins, A.F.T., Smith, N.A., Xing, E.P., Aguiar, P.M.Q. and Figueiredo, M.A.T. (2011). Online learning of structured predictors with multiple kernels, Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, Vol. 15, pp. 507-515.
McDonald, R., Crammer, K. and Pereira, F. (2005). Online large-margin training of dependency parsers, Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL’05, Ann Arbor, MI, USA, pp. 91-98.
Nagata, M. (1994). A stochastic Japanese morphological analyzer using a forward-DP backward-A* N-best search algorithm, Proceedings of the 15th Conference on Computational Linguistics, COLING ’94, Kyoto, Japan, Vol. 1, pp. 201-207.
Nemirovski, A., Juditsky, A., Lan, G. and Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming, SIAM Journal on Optimization 19(4): 1574-1609.
Ni, Y., Saunders, C., Szedmak, S. and Niranjan, M. (2010). The application of structured learning in natural language processing, Machine Translation 24(2): 71-85.
Nowozin, S. and Lampert, C.H. (2011). Structured learning and prediction in computer vision, Foundations and Trends in Computer Graphics and Vision 6(3-4): 185-365.
Platt, J.C. (1999). Fast training of support vector machines using sequential minimal optimization, in B. Schölkopf, C.J.C.
Burges and A.J. Smola (Eds.), Advances in Kernel Methods, MIT Press, Cambridge, MA, pp. 185-208.
Rakhlin, A., Shamir, O. and Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization, in J. Langford and J. Pineau (Eds.), Proceedings of the 29th International Conference on Machine Learning (ICML-12), Edinburgh, UK, pp. 449-456.
Ratliff, N.D., Bagnell, J.A. and Zinkevich, M.A. (2006). Subgradient methods for maximum margin structured learning, ICML Workshop on Learning in Structured Output Spaces, Pittsburgh, PA, USA.
Sas, J. and Żołnierek, A. (2013). Pipelined language model construction for Polish speech recognition, International Journal of Applied Mathematics and Computer Science 23(3): 649-668, DOI: 10.2478/amcs-2013-0049.
Shalev-Shwartz, S., Singer, Y. and Srebro, N. (2007). Pegasos: Primal estimated sub-gradient solver for SVM, Proceedings of the 24th International Conference on Machine Learning, ICML ’07, Corvalis, OR, USA, pp. 807-814.
Shalev-Shwartz, S., Singer, Y., Srebro, N. and Cotter, A. (2011).
Shamir, O. (2012). Open problem: Is averaging needed for strongly convex stochastic gradient descent?, Journal of Machine Learning Research 23: 47-1.
Shamir, O. and Zhang, T. (2012). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes, arXiv preprint, arXiv:1212.1824.
Soong, F.K. and Huang, E.-F. (1991). A tree-trellis based fast search for finding the N-best sentence hypotheses in continuous speech recognition, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Canada, Vol. 1, pp. 705-708.
Taskar, B., Guestrin, C. and Koller, D. (2004). Max-margin Markov networks, in S. Thrun, L. Saul and B. Schölkopf (Eds.), Advances in Neural Information Processing Systems 16, MIT Press, Cambridge, MA, pp. 25-32.
Tjong Kim Sang, E.F. and Buchholz, S. (2000). Introduction to the CoNLL-2000 shared task: Chunking, Proceedings of the 2nd Workshop on Learning Language in Logic/4th Conference on Computational Natural Language Learning, Lisbon, Portugal, Vol. 7, pp. 127-132.
Tsochantaridis, I., Joachims, T., Hofmann, T. and Altun, Y. (2005). Large margin methods for structured and interdependent output variables, Journal of Machine Learning Research 6: 1453-1484. Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm, IEEE Transactions on Information Theory 13(2): 260-269.
Weston, J. and Watkins, C. (1998). Multi-class support vector machines, Technical report, Department of Computer Science, Royal Holloway, University of London, London.
Xu, W. (2011). Towards optimal one pass large scale learning with averaged stochastic gradient descent, arXiv preprint, arXiv:1107.2490.