XPath/Python - How to get different html tags and text inside a <div> -


i'm trying scrape html content @ url: http://www.dlib.org/dlib/november14/beel/11beel.html python sintax:

    s="http://www.dlib.org/dlib/november14/beel/11beel.html"     content = requests.get(s)     tree = html.fromstring(content.text)     titoli = tree.xpath('/html/body/form/table[3]/tr/td/table[5]/tr/td/table[1]/tr/td[2]/h3/text()')     par = tree.xpath('/html/body/form/table[3]/tr/td/table[5]/tr/td/table[1]/tr/td[2]/p/text()')     articoli = json.dumps({'titoli':titoli,'contenuti':par})     print ("content-type: json")     print     print (articoli) 

the main request find xpath query return every tags, tags content , text inside useful div of page, can find path /html/body/form/table[3]/tr/td/table[5] or using web inspector under commented line: !-- content table --. code i've posted before not possible entire content of div titles , text inside p div, can't find way.

to actual html content of section of website using python/xpath, easier use from lxml import etree instead of from lxml import html. when set element tree, there function allows return html content of element, rather returning text content (as mentioned). code follows:

from lxml import etree import requests  s = "http://www.dlib.org/dlib/november14/beel/11beel.html" page = requests.get(s) tree = etree.html(page.text) element = tree.xpath('./body/form/table[3]/tr/td/table[5]') content = etree.tostring(element[0]) 

tree.xpath returns list of selected elements. in case, because using specific xpath, returns list containing 1 element. therefore have use etree.tostring(element[0]) access first element of list , return html content of element string.


Comments

Popular posts from this blog

c# - Binding a comma separated list to a List<int> in asp.net web api -

Delphi 7 and decode UTF-8 base64 -

html - Is there any way to exclude a single element from the style? (Bootstrap) -