Python统计pdf中英文单词的个数

大模型向量数据库机器学习

之前的文章提供了批量识别pdf中英文的方法,详见【python爬虫】批量识别pdf中的英文,自动翻译成中文上

以及自动pdf英文转中文文档,详见【python爬虫】批量识别pdf中的英文,自动翻译成中文下

本文实现python统计pdf中字符的个数。

本文目录

  1. 要统计字符的pdf文档

  2. 识别pdf中的字符

  3. 定义统计单页pdf中字符个数的函数

  4. 统计pdf中字符的个数

一、要统计字符的pdf文档

picture.image

首先看下要统计字符的pdf长什么样。

为了简单、清晰,本文以统计两页英文pdf字符为例进行阐述,代码直接可以应用到任意页数的英文pdf中。

二、识别pdf中的字符

picture.image

接着应用pdfplumber库识别pdf中的字符,具体代码如下:


 
 

  `import pdfplumber as plb` `file_path = r'F:\公众号\74_pdf英文翻译\murphy1996.pdf'` `#识别所有页的文字` `with plb.open(file_path) as pdf:` `k = 1` `for page in pdf.pages:`  `print(' ')` `print('第',k, '页')`  `print(page.extract_text())` `k += 1`
 参数详解:

file\_path:pdf文件存放路径。

plb.open:打开pdf文件。

page.extract\_text():获取该页pdf的文字内容。  


得到结果:


 
 

  `第 1 页` `Medical and Pediatric Oncology 27:62-63 (1996)` `Ecthyma Gangrenosum Occurring at Sites of Iatrogenic Trauma in Pediatric` `Oncology Patients` `0.M urphy, MB, BCh, BAO, MRCPI, P.J. Marsh, BSC, MB, ChB, MRCPath,` `s.1.` `j. Gray, MB, ChB, MRCP, MARCPath, Pedler, MB, ChB, MRCPath, and` `j. Kernahan, MB, BS, FRcP(Ed) DCH` `We report two cases of ecthyma gan- mary skin lesion. Both required prolonged` `grenosum which occurred at sites of iatro- courses of antibiotics and one patient died.` `genic trauma. The first case developed due The different pathogenic mechanisms and` `to metastatic seeding with Pseudornonas outcomes associated with this condition are` `aeruginosa during an episode of septicaemia discussed. 01996 Wiley-Liss, Inc.` `and the second case occurred as a pri-` `Key words: ecthyma gangrenosum, Pseudomonas aeruginosa, iatrogenic` `INTRODUCTION ate. No further lesions developed during the remainder of` `her treatment.` `Ecthyma gangrenosum (EG) is a well recognized cuta-` `neous manifestation of P.a eruginosa infections in immu- Case 2` `nocompromised patients [ 11. We report two cases of EG` `A 13-month-old girl was admitted for investigation of` `occurring at sites of iatrogenic trauma in pediatric oncol-` `pancytopenia. A diagnosis of aplastic anaemia was made` `ogy patients and demonstrate important pathogenic and` `following left iliac crest marrow aspirate and trephine` `clinical features of this condition.` `bone biopsy. She became pyrexial on day 10 following` `admission but repeated blood cultures were negative. On` `day 24, a 1 cm2 sloughing necrotic area surrounded by` `CASE REPORTS purplish erythema was noted at the bone marrow Sam-` `pling site. At this time her Hb was 6.6 g/dl and WCC was` `Case 1` `2.4 X 109/L( neutrophils 0.6 X 109/L). She was treated` `A 2-year-old girl with acute lymphoblastic leukaemia` `empirically with azlocillin and gentamicin. P. aerugi-` `was admitted with a fever, 2 weeks after a course of` `nma was isolated from the lesion swab and a diagnosis of` `chemotherapy which included intrathecal methotrexate.` `EG was made. Blood cultures remained sterile and radio-` `She was profoundly neutropenic (WCC 2.2 X lo9 /L, no logical examination did not reveal any evidence of bony` `neutrophils). Physical, examination revealed a swollen,` `involvement. Despite prolonged antibiotic and topical` `erythematous area with a central black eschar over the` `therapy, the iliac crest lesion failed to improve. On day` `lumbar puncture site. She was commenced empirically` `32, she became pyrexial and Enterobacter sp. was iso-` `on imipenem-cilastatin and teicoplanin. Following isola-` `lated from two blood cultures. She was treated with intra-` `tion of P. aeruginosa from both blood cultures and lesion` `venous gentamicin and ciprofloxacin. Throughout her` `swab, a diagnosis of EG was made and therapy was` `illness she required numerous transfusions with platelets` `changed to ceftazidime and amikacin. Radiological as-` `and red blood cells. A suitable bone marrow donor could` `sessment of the lumbar spine did not reveal any evidence` `of bony involvement. She became apyrexial on day 3 as` `her neutropenia began to recover. She did not require` `From the Departments of Microbiology (O.M., P.J.M., J.G., S .J.P.),` `treatment with colony stimulating factors. Antimicrobials` `and Child Health (J.K.), Royal Victoria Infirmary, Newcastle upon` `were discontinued on day 17. Topical silver sulphadia- Tyne, UK.` `zine was continued for a further 4 weeks as the lesion` `Received April 6, 1995; accepted August 21, 1995` `healed slowly by granulation from the base.` `Address reprint requests to 0. Murphy, M.B., B.Ch., B.A.O.,` `For subsequent chemotherapy, high dose intravenous` `M.R.C.P.I., Department of Microbiology, Royal Victoria Infirmary,` `methotrexate was substituted for intrathecal methotrex- Queen Victoria Road, Newcastle upon Tyne NEl 4LP, UK.` `0 1996 Wiley-Liss, Inc.` `第 2 页` `EG at Sites of Iatrogenic Trauma 63` `not be found. A 2-week course of GMCSF was started on In case 1, we believe that seeding to an area of trauma-` `day 53 but no improvement in her haematological param- tised skin occurred during bacteraemia. Early recognition` `eters was seen and her general condition continued to and aggressive treatment may have played a role in con-` `deteriorate. On day 85, she again became pyrexial and a trolling the primary septicaemia but recovery of the pa-` `1. O X 1.5 cm ulcer on her right labium majus was noted. tient’s bone marrow probably contributed more to the` `Her WCC was 0.4 X 109/L. P. aeruginosa was isolated long-term outcome. In case 2, repeated negative blood` `from blood cultures for the first time. Despite aggressive cultures suggest that EG occurred as a primary lesion at a` `antibiotic and antifungal treatment, further lesions devel- site of prior skin trauma. Despite aggressive treatment,` `oped on her face and chest and she subsequently died. persistent profound neutropenia was associated with fail-` `ure of the lesion to resolve and the development of a` `secondary bacteraemia and further lesions.` `DISCUSSION` `Paediatric oncology patients are frequently subject to` `Although not pathognomic, ecthyma gangrenosum is a invasive procedures involving minor skin trauma which` `well recognised manifestation of P. aeruginosa infection may predispose them to infection with various organisms` `in immunocompromised patients. Factors such as neutro- including P. aeruginosa. EG is an extremely difficult` `penia, use of bread spectrum antibiotics, loss of skin condition to treat and a high index of suspicion in this` `integrity, and moist conditions have been shown to pre- at-risk population is required to ensure early diagnosis` `dispose to infection with P. aeruginosa and the develop- and optimum treatment.` `ment of EG [2]. Two possible pathogenic mechanisms in` `the development of this condition have been postulated` `[2,3]. In classic or bacteraemic EG, the lesion is consid-` `ered to represent blood-borne metastatic seeding of P.` `aeruginosa to the skin. In non-bacteraemic or primary REFERENCES` `EG, the lesion is located at the site of entry of the organ-` `1. Dorff GJ, Geimer NF, Rosenthal DR, et al.: Pseudornonas septice-` `ism into the skin. In these cases the lesions have been` `mia: illustrated evolution of its skin lesions. Arch Intern Med 128:` `found to occur more commonly in the distribution of 591, 1971.` `exocrine glands and secondary bacteraemia has rarely 2 El Baze P, Thyss A, Vinti H, Deville A, Dellamonica P, Ortonne` `been reported. Early diagnosis and aggressive therapy are J-P: A study of nineteen immunocompromised patients with exten-` `sive skin lesions caused by Pseudomonas aeruginosa with and` `important in the management of these patients. Although` `without bacteraemia. Acta Derm Venereol (Stockh) 71:411-415,` `patients with non-bacteraemic lesions have generally` `1991.` `been found to have a better prognosis than those with 3. Huminer D, Siegman-Igra Y, Morduchowicz G, Pitlik SD: Ec-` `bacteraemic EG [3,4], our experience of survival ulti- thyma gangrenosum without bacteraemia. Report of six cases and a` `mately being determined by recovery of neutrophils con- review of the literature. Arch Intern Med 147:299-301, 1987.` `4. Fergie JE, Patrick CP, Lott L: Pseudomonas aeruginosa cellulitis` `firms that of others [5].` `and ecthyma gangrenosum in imrnunocompromised children. Pedi-` `To our knowledge, these are the first reports of EG atr Infect Dis J 10:496-500, 1991.` `occurring at sites of iatrogenic trauma in paediatric oncol- 5. Greene SL, Daniel Su WP, Muller SA: Ecthyma gangrenosurn:` `ogy patients. The only previous report was in an adult report of clinical, histopathologic, and bacteriologic aspects of` `with AML who developed EG at the site of placement of eight cases. J Am Acad Dermatol 11:781-787, 1984.` `6. Klepflish A, Bembi A. Ecthyma gangrenosum caused by a roving` `an ECG electrode [6]. In this case, skin trauma coincided` `chest electrode in an acute myeloid leukaemia patient with` `with a documented P. aeruginosa septicaemia and meta-` `Pseudomonas septicaemia [Letterl. J Am Acad Dermatol 18585-` `static seeding was felt to have occurred. 586, 1988.`
   


  


  



  
 **三、定义统计单页pdf中字符个数的函数** 
 
  
 ![picture.image](https://p3-volc-community-sign.byteimg.com/tos-cn-i-tlddhu82om/6c8aae13abe94aa9b6e0a096c86ba104~tplv-tlddhu82om-image.image?=&rk3s=8031ce6d&x-expires=1756005165&x-signature=RbKdV4GsRWB5OJEUDocXrs6yGho%3D)
 
 
    
 应用正则表达式把单页内容处理成列表,并用filter函数过滤掉空值,再统计该页的字符数。
 具体代码如下:


 
 

  `import re`  `import random` `def wd_num(pg):` `pg = pg` `pg_wd_num = len(list(filter(None, re.split(r"[\n|\s|,|!|.|(|)|;|-|/|:]", pg))))`  `return pg_wd_num`
 

参数详解:

re.split(r"[\n|\s|,|!|.|(|)|;|-|/|:]", pg):以空格,换行符,逗号,句号,感叹号等为分隔符,把pg内容变成列表。

filter(None, ...):去掉列表中的空格。

len:求列表的长度。

为了大家理解得更透彻,按由内到外的方式逐层实现单页pdf字符统计。  


首先是re.split函数调用,代码如下:


 
 

  `pg = '''Ecthyma Gangrenosum Occurring at Sites of Iatrogenic Trauma in Pediatric` `Oncology Patients'''` `re.split(r"[\n|\s|,|!|.]", pg)`
 
 
 得到结果:
 
 
 

  `['Ecthyma',` `'Gangrenosum',` `'Occurring',` `'at',` `'Sites',` `'of',` `'Iatrogenic',` `'Trauma',` `'in',` `'Pediatric',` `'',` `'',` `'',` `'',` `'Oncology',` `'Patients']`
 
 
 可以发现该函数按指定的分隔符把字符串分割成了一个list。
 
 
 接着过滤掉list中的空值,代码如下:
 
 
 

  `list(filter(None, re.split(r"[\n|\s|,|!|.|(|)|;|-|/|:]", pg)))`
 
 
 得到结果:
 
 
 

  `['Ecthyma',` `'Gangrenosum',` `'Occurring',` `'at',` `'Sites',` `'of',` `'Iatrogenic',` `'Trauma',` `'in',` `'Pediatric',` `'Oncology',` `'Patients']`
 最后统计这个list的长度,即统计字符串中字符的个数,代码如下:


 
 

  `len(list(filter(None, re.split(r"[\n|\s|,|!|.|(|)|;|-|/|:]", pg))))`
 得到结果:


 
 

  `12`
 可以手动核对一下,结果是一致的。


 
   

 
 
 **四、统计pdf中字符的个数** 
 
 
 ![picture.image](https://p3-volc-community-sign.byteimg.com/tos-cn-i-tlddhu82om/6c8aae13abe94aa9b6e0a096c86ba104~tplv-tlddhu82om-image.image?=&rk3s=8031ce6d&x-expires=1756005165&x-signature=RbKdV4GsRWB5OJEUDocXrs6yGho%3D)
 
 
   
 最后应用循环统计每一页的字符数量,以及整个pdf的字符数量,代码如下:
 
 
 

  `with plb.open(file_path) as pdf:` `k = 1` `sum_wd_num = 0` `for page in pdf.pages:`  `print(' ')`  `pg = page.extract_text()` `sum_wd_num += wd_num(pg)` `print('第',k, '页有',wd_num(pg),'个字符')`  `k += 1` `print(' ')`  `print('总计有',sum_wd_num,'个字符')` 
 
 
 得到结果:
   

 
 
 

  `第 1 页有 611 个字符` `第 2 页有 674 个字符` `总计有 1285 个字符`
 
 
 至此,Python统计pdf中字符的个数
 已讲解完毕,需要的朋友可以自己跟着代码尝试一遍
 ![picture.image](https://p3-volc-community-sign.byteimg.com/tos-cn-i-tlddhu82om/56093e5e553d42f9a765a4b9d5f74d80~tplv-tlddhu82om-image.image?=&rk3s=8031ce6d&x-expires=1756005165&x-signature=yXzAi6LSK1DeBcYXBcjnz75x44g%3D)
 。
 
 

限时免费进群 】群内讨论学习Python、玩转Python、风控建模、人工智能、数据分析相关问题,还提供招聘内推信息、优秀文章、学习视频,也可交流工作中遇到的相关问题。需要的朋友添加微信号19967879837,加时备注想进的群,比如风控建模。

往期回顾:

一文囊括Python中的函数,持续更新。。。

一文囊括Python中的有趣案例,持续更新。。。

一文囊括Python中的数据分析与绘图,持续更新。。。

一文囊括风控模型搭建(原理+Python实现),持续更新。。。

picture.image

picture.image

限时免费加群

19967879837

添加微信号、手机号

0
0
0
0
关于作者

文章

0

获赞

0

收藏

0

相关资源
高性能存储虚拟化方案 NVMe over Fabric 在火山引擎的演进
在云计算中,虚拟化存储扮演着重要角色,其中 iSCSI 协议在业界开放、流行多年。近年来,拥有更优性能的 NVMe over Fabrics 协议也得到了发展。本次分享介绍了 NVMe over Fabrics 在云原生和虚拟化方向的演进工作和成果。
相关产品
评论
未登录
看完啦,登录分享一下感受吧~
暂无评论