之前的文章提供了批量识别pdf中英文的方法,详见【python爬虫】批量识别pdf中的英文,自动翻译成中文上。
以及自动pdf英文转中文文档,详见【python爬虫】批量识别pdf中的英文,自动翻译成中文下。
本文实现python统计pdf中字符的个数。
本文目录
要统计字符的pdf文档
识别pdf中的字符
定义统计单页pdf中字符个数的函数
统计pdf中字符的个数
一、要统计字符的pdf文档
首先看下要统计字符的pdf长什么样。
为了简单、清晰,本文以统计两页英文pdf字符为例进行阐述,代码直接可以应用到任意页数的英文pdf中。
二、识别pdf中的字符
接着应用pdfplumber库识别pdf中的字符,具体代码如下:
`import pdfplumber as plb` `file_path = r'F:\公众号\74_pdf英文翻译\murphy1996.pdf'` `#识别所有页的文字` `with plb.open(file_path) as pdf:` `k = 1` `for page in pdf.pages:` `print(' ')` `print('第',k, '页')` `print(page.extract_text())` `k += 1`
参数详解:
file\_path:pdf文件存放路径。
plb.open:打开pdf文件。
page.extract\_text():获取该页pdf的文字内容。
得到结果:
`第 1 页` `Medical and Pediatric Oncology 27:62-63 (1996)` `Ecthyma Gangrenosum Occurring at Sites of Iatrogenic Trauma in Pediatric` `Oncology Patients` `0.M urphy, MB, BCh, BAO, MRCPI, P.J. Marsh, BSC, MB, ChB, MRCPath,` `s.1.` `j. Gray, MB, ChB, MRCP, MARCPath, Pedler, MB, ChB, MRCPath, and` `j. Kernahan, MB, BS, FRcP(Ed) DCH` `We report two cases of ecthyma gan- mary skin lesion. Both required prolonged` `grenosum which occurred at sites of iatro- courses of antibiotics and one patient died.` `genic trauma. The first case developed due The different pathogenic mechanisms and` `to metastatic seeding with Pseudornonas outcomes associated with this condition are` `aeruginosa during an episode of septicaemia discussed. 01996 Wiley-Liss, Inc.` `and the second case occurred as a pri-` `Key words: ecthyma gangrenosum, Pseudomonas aeruginosa, iatrogenic` `INTRODUCTION ate. No further lesions developed during the remainder of` `her treatment.` `Ecthyma gangrenosum (EG) is a well recognized cuta-` `neous manifestation of P.a eruginosa infections in immu- Case 2` `nocompromised patients [ 11. We report two cases of EG` `A 13-month-old girl was admitted for investigation of` `occurring at sites of iatrogenic trauma in pediatric oncol-` `pancytopenia. A diagnosis of aplastic anaemia was made` `ogy patients and demonstrate important pathogenic and` `following left iliac crest marrow aspirate and trephine` `clinical features of this condition.` `bone biopsy. She became pyrexial on day 10 following` `admission but repeated blood cultures were negative. On` `day 24, a 1 cm2 sloughing necrotic area surrounded by` `CASE REPORTS purplish erythema was noted at the bone marrow Sam-` `pling site. At this time her Hb was 6.6 g/dl and WCC was` `Case 1` `2.4 X 109/L( neutrophils 0.6 X 109/L). She was treated` `A 2-year-old girl with acute lymphoblastic leukaemia` `empirically with azlocillin and gentamicin. P. aerugi-` `was admitted with a fever, 2 weeks after a course of` `nma was isolated from the lesion swab and a diagnosis of` `chemotherapy which included intrathecal methotrexate.` `EG was made. Blood cultures remained sterile and radio-` `She was profoundly neutropenic (WCC 2.2 X lo9 /L, no logical examination did not reveal any evidence of bony` `neutrophils). Physical, examination revealed a swollen,` `involvement. Despite prolonged antibiotic and topical` `erythematous area with a central black eschar over the` `therapy, the iliac crest lesion failed to improve. On day` `lumbar puncture site. She was commenced empirically` `32, she became pyrexial and Enterobacter sp. was iso-` `on imipenem-cilastatin and teicoplanin. Following isola-` `lated from two blood cultures. She was treated with intra-` `tion of P. aeruginosa from both blood cultures and lesion` `venous gentamicin and ciprofloxacin. Throughout her` `swab, a diagnosis of EG was made and therapy was` `illness she required numerous transfusions with platelets` `changed to ceftazidime and amikacin. Radiological as-` `and red blood cells. A suitable bone marrow donor could` `sessment of the lumbar spine did not reveal any evidence` `of bony involvement. She became apyrexial on day 3 as` `her neutropenia began to recover. She did not require` `From the Departments of Microbiology (O.M., P.J.M., J.G., S .J.P.),` `treatment with colony stimulating factors. Antimicrobials` `and Child Health (J.K.), Royal Victoria Infirmary, Newcastle upon` `were discontinued on day 17. Topical silver sulphadia- Tyne, UK.` `zine was continued for a further 4 weeks as the lesion` `Received April 6, 1995; accepted August 21, 1995` `healed slowly by granulation from the base.` `Address reprint requests to 0. Murphy, M.B., B.Ch., B.A.O.,` `For subsequent chemotherapy, high dose intravenous` `M.R.C.P.I., Department of Microbiology, Royal Victoria Infirmary,` `methotrexate was substituted for intrathecal methotrex- Queen Victoria Road, Newcastle upon Tyne NEl 4LP, UK.` `0 1996 Wiley-Liss, Inc.` `第 2 页` `EG at Sites of Iatrogenic Trauma 63` `not be found. A 2-week course of GMCSF was started on In case 1, we believe that seeding to an area of trauma-` `day 53 but no improvement in her haematological param- tised skin occurred during bacteraemia. Early recognition` `eters was seen and her general condition continued to and aggressive treatment may have played a role in con-` `deteriorate. On day 85, she again became pyrexial and a trolling the primary septicaemia but recovery of the pa-` `1. O X 1.5 cm ulcer on her right labium majus was noted. tient’s bone marrow probably contributed more to the` `Her WCC was 0.4 X 109/L. P. aeruginosa was isolated long-term outcome. In case 2, repeated negative blood` `from blood cultures for the first time. Despite aggressive cultures suggest that EG occurred as a primary lesion at a` `antibiotic and antifungal treatment, further lesions devel- site of prior skin trauma. Despite aggressive treatment,` `oped on her face and chest and she subsequently died. persistent profound neutropenia was associated with fail-` `ure of the lesion to resolve and the development of a` `secondary bacteraemia and further lesions.` `DISCUSSION` `Paediatric oncology patients are frequently subject to` `Although not pathognomic, ecthyma gangrenosum is a invasive procedures involving minor skin trauma which` `well recognised manifestation of P. aeruginosa infection may predispose them to infection with various organisms` `in immunocompromised patients. Factors such as neutro- including P. aeruginosa. EG is an extremely difficult` `penia, use of bread spectrum antibiotics, loss of skin condition to treat and a high index of suspicion in this` `integrity, and moist conditions have been shown to pre- at-risk population is required to ensure early diagnosis` `dispose to infection with P. aeruginosa and the develop- and optimum treatment.` `ment of EG [2]. Two possible pathogenic mechanisms in` `the development of this condition have been postulated` `[2,3]. In classic or bacteraemic EG, the lesion is consid-` `ered to represent blood-borne metastatic seeding of P.` `aeruginosa to the skin. In non-bacteraemic or primary REFERENCES` `EG, the lesion is located at the site of entry of the organ-` `1. Dorff GJ, Geimer NF, Rosenthal DR, et al.: Pseudornonas septice-` `ism into the skin. In these cases the lesions have been` `mia: illustrated evolution of its skin lesions. Arch Intern Med 128:` `found to occur more commonly in the distribution of 591, 1971.` `exocrine glands and secondary bacteraemia has rarely 2 El Baze P, Thyss A, Vinti H, Deville A, Dellamonica P, Ortonne` `been reported. Early diagnosis and aggressive therapy are J-P: A study of nineteen immunocompromised patients with exten-` `sive skin lesions caused by Pseudomonas aeruginosa with and` `important in the management of these patients. Although` `without bacteraemia. Acta Derm Venereol (Stockh) 71:411-415,` `patients with non-bacteraemic lesions have generally` `1991.` `been found to have a better prognosis than those with 3. Huminer D, Siegman-Igra Y, Morduchowicz G, Pitlik SD: Ec-` `bacteraemic EG [3,4], our experience of survival ulti- thyma gangrenosum without bacteraemia. Report of six cases and a` `mately being determined by recovery of neutrophils con- review of the literature. Arch Intern Med 147:299-301, 1987.` `4. Fergie JE, Patrick CP, Lott L: Pseudomonas aeruginosa cellulitis` `firms that of others [5].` `and ecthyma gangrenosum in imrnunocompromised children. Pedi-` `To our knowledge, these are the first reports of EG atr Infect Dis J 10:496-500, 1991.` `occurring at sites of iatrogenic trauma in paediatric oncol- 5. Greene SL, Daniel Su WP, Muller SA: Ecthyma gangrenosurn:` `ogy patients. The only previous report was in an adult report of clinical, histopathologic, and bacteriologic aspects of` `with AML who developed EG at the site of placement of eight cases. J Am Acad Dermatol 11:781-787, 1984.` `6. Klepflish A, Bembi A. Ecthyma gangrenosum caused by a roving` `an ECG electrode [6]. In this case, skin trauma coincided` `chest electrode in an acute myeloid leukaemia patient with` `with a documented P. aeruginosa septicaemia and meta-` `Pseudomonas septicaemia [Letterl. J Am Acad Dermatol 18585-` `static seeding was felt to have occurred. 586, 1988.`
**三、定义统计单页pdf中字符个数的函数**

应用正则表达式把单页内容处理成列表,并用filter函数过滤掉空值,再统计该页的字符数。
具体代码如下:
`import re` `import random` `def wd_num(pg):` `pg = pg` `pg_wd_num = len(list(filter(None, re.split(r"[\n|\s|,|!|.|(|)|;|-|/|:]", pg))))` `return pg_wd_num`
参数详解:
re.split(r"[\n|\s|,|!|.|(|)|;|-|/|:]", pg):以空格,换行符,逗号,句号,感叹号等为分隔符,把pg内容变成列表。
filter(None, ...):去掉列表中的空格。
len:求列表的长度。
为了大家理解得更透彻,按由内到外的方式逐层实现单页pdf字符统计。
首先是re.split函数调用,代码如下:
`pg = '''Ecthyma Gangrenosum Occurring at Sites of Iatrogenic Trauma in Pediatric` `Oncology Patients'''` `re.split(r"[\n|\s|,|!|.]", pg)`
得到结果:
`['Ecthyma',` `'Gangrenosum',` `'Occurring',` `'at',` `'Sites',` `'of',` `'Iatrogenic',` `'Trauma',` `'in',` `'Pediatric',` `'',` `'',` `'',` `'',` `'Oncology',` `'Patients']`
可以发现该函数按指定的分隔符把字符串分割成了一个list。
接着过滤掉list中的空值,代码如下:
`list(filter(None, re.split(r"[\n|\s|,|!|.|(|)|;|-|/|:]", pg)))`
得到结果:
`['Ecthyma',` `'Gangrenosum',` `'Occurring',` `'at',` `'Sites',` `'of',` `'Iatrogenic',` `'Trauma',` `'in',` `'Pediatric',` `'Oncology',` `'Patients']`
最后统计这个list的长度,即统计字符串中字符的个数,代码如下:
`len(list(filter(None, re.split(r"[\n|\s|,|!|.|(|)|;|-|/|:]", pg))))`
得到结果:
`12`
可以手动核对一下,结果是一致的。
**四、统计pdf中字符的个数**

最后应用循环统计每一页的字符数量,以及整个pdf的字符数量,代码如下:
`with plb.open(file_path) as pdf:` `k = 1` `sum_wd_num = 0` `for page in pdf.pages:` `print(' ')` `pg = page.extract_text()` `sum_wd_num += wd_num(pg)` `print('第',k, '页有',wd_num(pg),'个字符')` `k += 1` `print(' ')` `print('总计有',sum_wd_num,'个字符')`
得到结果:
`第 1 页有 611 个字符` `第 2 页有 674 个字符` `总计有 1285 个字符`
至此,Python统计pdf中字符的个数
已讲解完毕,需要的朋友可以自己跟着代码尝试一遍

。
【 限时免费进群 】群内讨论学习Python、玩转Python、风控建模、人工智能、数据分析相关问题,还提供招聘内推信息、优秀文章、学习视频,也可交流工作中遇到的相关问题。需要的朋友添加微信号19967879837,加时备注想进的群,比如风控建模。
往期回顾:
一文囊括风控模型搭建(原理+Python实现),持续更新。。。
限时免费加群
19967879837
添加微信号、手机号