XPath/Python - How to get different html tags and text inside a <div> -


i'm trying scrape html content @ url: http://www.dlib.org/dlib/november14/beel/11beel.html python sintax:

    s="http://www.dlib.org/dlib/november14/beel/11beel.html"     content = requests.get(s)     tree = html.fromstring(content.text)     titoli = tree.xpath('/html/body/form/table[3]/tr/td/table[5]/tr/td/table[1]/tr/td[2]/h3/text()')     par = tree.xpath('/html/body/form/table[3]/tr/td/table[5]/tr/td/table[1]/tr/td[2]/p/text()')     articoli = json.dumps({'titoli':titoli,'contenuti':par})     print ("content-type: json")     print     print (articoli) 

the main request find xpath query return every tags, tags content , text inside useful div of page, can find path /html/body/form/table[3]/tr/td/table[5] or using web inspector under commented line: !-- content table --. code i've posted before not possible entire content of div titles , text inside p div, can't find way.

to actual html content of section of website using python/xpath, easier use from lxml import etree instead of from lxml import html. when set element tree, there function allows return html content of element, rather returning text content (as mentioned). code follows:

from lxml import etree import requests  s = "http://www.dlib.org/dlib/november14/beel/11beel.html" page = requests.get(s) tree = etree.html(page.text) element = tree.xpath('./body/form/table[3]/tr/td/table[5]') content = etree.tostring(element[0]) 

tree.xpath returns list of selected elements. in case, because using specific xpath, returns list containing 1 element. therefore have use etree.tostring(element[0]) access first element of list , return html content of element string.


Comments

Popular posts from this blog

c# - Binding a comma separated list to a List<int> in asp.net web api -

how to prompt save As Box in Excel Interlop c# MVC 4 -

xslt 1.0 - How to access or retrieve mets content of an item from another item? -