1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.
  2. Join our Telegram chat: https://t.me/a_parser_en
    Dismiss Notice

Parsing RSS

Discussion in 'Share Experience' started by Support, Sep 22, 2015.

  1. Support

    Support Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 16, 2012
    Messages:
    4,377
    Likes Received:
    2,107
    The purpose of this article - to show the general direction in parsing RSS. As an example, we will use RSS of our forum: http://en.a-parser.com/forum/english-news/index.rss. This preset may be used for other sites, but due to different standards, may need to make some modifications.

    The process of parsing is quite simple and consists mainly of information search using regular expressions and its output to the result file.[​IMG]
    [​IMG]
    • Use Net::HTTP Net::HTTP
    • The proxy may not be used
    • First parse all <item>...</item>
    • Then from the resulting array parse the necessary information. In this example: title, date, URL and content.
    • Purified from the superfluous elements (in the example it is HTML tags and entities, as well as other residual lines).
    • Outputs a result using the capabilities of Template Toolkit, in the desired format.
    When parsing such a way that, if at some moment is not the desired parameter in the body of RSS feed, it simply will empty the integrity and consistency of information can not be broken (on the specification of RSS, all the parameters are optional, so may skip). The result is a file with the same contents:
    eyJwcmVzZXQiOiJSU1MiLCJ2YWx1ZSI6eyJwcmVzZXQiOiJSU1MiLCJwYXJzZXJz IjpbWyJOZXQ6OkhUVFAiLCJkZWZhdWx0Iix7InR5cGUiOiJjdXN0b21SZXN1bHQi LCJyZXN1bHQiOiJkYXRhIiwicmVnZXgiOiI8aXRlbT4oLis/KTxcXC9pdGVtPiIs InJlZ2V4VHlwZSI6InNnIiwicmVzdWx0VHlwZSI6ImFycmF5IiwiYXJyYXlOYW1l IjoiaXRlbXMiLCJyZXN1bHRzIjpbIml0ZW0iXX0seyJ0eXBlIjoib3ZlcnJpZGUi LCJpZCI6InVzZXByb3h5IiwidmFsdWUiOmZhbHNlfSx7InR5cGUiOiJjdXN0b21S ZXN1bHQiLCJyZXN1bHQiOlsiaXRlbXMiLCJpdGVtIl0sInJlZ2V4IjoiPHRpdGxl PiguKz8pPFxcL3RpdGxlPiIsInJlZ2V4VHlwZSI6InMiLCJyZXN1bHRUeXBlIjoi YXJyYXkiLCJhcnJheU5hbWUiOiJ0aXRsZXMiLCJyZXN1bHRzIjpbInRpdGxlIl19 LHsidHlwZSI6ImN1c3RvbVJlc3VsdCIsInJlc3VsdCI6WyJpdGVtcyIsIml0ZW0i XSwicmVnZXgiOiI8cHViRGF0ZT4oLis/KTxcXC9wdWJEYXRlPiIsInJlZ2V4VHlw ZSI6InMiLCJyZXN1bHRUeXBlIjoiYXJyYXkiLCJhcnJheU5hbWUiOiJkYXRlcyIs InJlc3VsdHMiOlsiZGF0ZSJdfSx7InR5cGUiOiJjdXN0b21SZXN1bHQiLCJyZXN1 bHQiOlsiaXRlbXMiLCJpdGVtIl0sInJlZ2V4IjoiPGxpbms+KC4rPyk8XFwvbGlu az4iLCJyZWdleFR5cGUiOiJzIiwicmVzdWx0VHlwZSI6ImFycmF5IiwiYXJyYXlO YW1lIjoibGlua3MiLCJyZXN1bHRzIjpbImxpbmsiXX0seyJ0eXBlIjoiY3VzdG9t UmVzdWx0IiwicmVzdWx0IjpbIml0ZW1zIiwiaXRlbSJdLCJyZWdleCI6Iig/Ojxj b250ZW50fDxkZXNjcmlwdGlvbikuKz8oLis/KSg/OjxcXC9jb250ZW50fDxcXC9k ZXNjcmlwdGlvbikiLCJyZWdleFR5cGUiOiJzIiwicmVzdWx0VHlwZSI6ImFycmF5 IiwiYXJyYXlOYW1lIjoiZGVzY3MiLCJyZXN1bHRzIjpbImRlc2MiXX0seyJ0eXBl Ijoib3ZlcnJpZGUiLCJpZCI6ImRldGVjdGNoYXJzZXQiLCJ2YWx1ZSI6dHJ1ZX0s eyJ0eXBlIjoib3ZlcnJpZGUiLCJpZCI6ImZvcm1hdHJlc3VsdCIsInZhbHVlIjoi JHF1ZXJ5XFxuXFxuXG5bJSBpID0gMDtcbldISUxFIGkgPCBpdGVtcy5zaXplO1xu dGl0bGVzLiRpLnRpdGxlIF9cIiAtIFwiIF8gZGF0ZXMuJGkuZGF0ZSBfXCJcXG5c IjtcbmxpbmtzLiRpLmxpbmsgX1wiXFxuXCI7XG5kZXNjcy4kaS5kZXNjIF9cIlxc bioqKioqKioqKipcXG5cIjtcbmkgPSBpICsgMTtcbkVORCAlXVxuIn1dXSwicmVz dWx0c0Zvcm1hdCI6IiRwMS5wcmVzZXQiLCJyZXN1bHRzU2F2ZVRvIjoiZmlsZSIs InJlc3VsdHNGaWxlTmFtZSI6IiRkYXRlZmlsZS5mb3JtYXQoKS50eHQiLCJhZGRp dGlvbmFsRm9ybWF0cyI6W10sInJlc3VsdHNVbmlxdWUiOiJubyIsInF1ZXJ5Rm9y bWF0IjpbIiRxdWVyeSJdLCJ1bmlxdWVRdWVyaWVzIjpmYWxzZSwic2F2ZUZhaWxl ZFF1ZXJpZXMiOmZhbHNlLCJpdGVyYXRvck9wdGlvbnMiOnsib25BbGxMZXZlbHMi OmZhbHNlLCJxdWVyeUJ1aWxkZXJzQWZ0ZXJJdGVyYXRvciI6ZmFsc2V9LCJyZXN1 bHRzT3B0aW9ucyI6eyJvdmVyd3JpdGUiOmZhbHNlfSwiZG9Mb2ciOiJubyIsImtl ZXBVbmlxdWUiOiJObyIsIm1vcmVPcHRpb25zIjpmYWxzZSwicmVzdWx0c1ByZXBl bmQiOiIiLCJyZXN1bHRzQXBwZW5kIjoiIiwicXVlcnlCdWlsZGVycyI6W10sInJl c3VsdHNCdWlsZGVycyI6W3sic291cmNlIjpbMCxbImRlc2NzIiwiZGVzYyJdXSwi dHlwZSI6ImRlY29kZUh0bWwiLCJhcnJheSI6ImRlc2NzIiwidG8iOiJkZXNjIn0s eyJzb3VyY2UiOlswLFsiZGVzY3MiLCJkZXNjIl1dLCJ0eXBlIjoic3RyaW5nUmVw bGFjZSIsImFycmF5IjoiZGVzY3MiLCJzZWFyY2giOiJlbmNvZGVkPjwhW0NEQVRB WyIsInJlcGxhY2UiOiIiLCJ0byI6ImRlc2MifSx7InNvdXJjZSI6WzAsWyJkZXNj cyIsImRlc2MiXV0sInR5cGUiOiJzdHJpbmdSZXBsYWNlIiwiYXJyYXkiOiJkZXNj cyIsInNlYXJjaCI6Il1dPiIsInJlcGxhY2UiOiIiLCJ0byI6ImRlc2MifSx7InNv dXJjZSI6WzAsWyJkZXNjcyIsImRlc2MiXV0sInR5cGUiOiJyZW1vdmVIdG1sIiwi YXJyYXkiOiJkZXNjcyIsInRvIjoiZGVzYyJ9LHsic291cmNlIjpbMCxbImRlc2Nz IiwiZGVzYyJdXSwidHlwZSI6InN0cmluZ1JlcGxhY2UiLCJhcnJheSI6ImRlc2Nz Iiwic2VhcmNoIjoiPCFbQ0RBVEFbIiwicmVwbGFjZSI6IiIsInRvIjoiZGVzYyJ9 XSwiY29uZmlnT3ZlcnJpZGVzIjpbXX19
     

Share This Page