面向站長和網站管理員的Web緩存加速指南[翻譯]
原文(英文)地址:
http://www.mnot.net/cache_docs/
版權聲明:
署名-非商業性使用-禁止演繹 2.0
轉載: http://www.chedong.com/tech/cache_docs.html
這是一篇知識性的文檔,主要目的是為了讓Web緩存相關概念更容易被開發者理解并應用于實際的應用環境中。為了簡要起見,某些實現方面的細節被簡化或省略了。如果你更關心細節實現則完全不必耐心看完本文,后面參考文檔和更多深入閱讀部分可能是你更需要的內容。
- 什么是Web緩存,為什么要使用它?
-
緩存的類型:
- 瀏覽器緩存;
- 代理服務器緩存;
- Web緩存無害嗎?為什么要鼓勵緩存?
- Web緩存如何工作:
-
如何控制(控制不)緩存:
- HTML Meta標簽 vs. HTTP頭信息;
- Pragma HTTP頭信息(為什么不起作用);
- 使用Expires(過期時間)HTTP頭信息控制保鮮期;
-
Cache-Control(緩存控制) HTTP頭信息;
- 校驗參數和校驗;
- 創建利于緩存網站的竅門;
- 編寫利于緩存的腳本;
- 常見問題解答;
- 緩存機制的實現:Web服務器端配置;
- 緩存機制的實現:服務器端腳本;
- 參考文檔和深入閱讀;
- 關于本文檔;
什么是Web緩存,為什么要使用它?
Web緩存位于Web服務器之間(1個或多個,內容源服務器)和客戶端之間(1個或多個):緩存會根據進來的請求保存輸出內容的副本,例如html頁面, 圖片,文件(統稱為副本),然后,當下一個請求來到的時候:如果是相同的URL,緩存直接使用副本響應訪問請求,而不是向源服務器再次發送請求。使用緩存主要有2大理由:
- 減少相應延遲 :因為請求從緩存服務器(離客戶端更近)而不是源服務器被相應,這個過程耗時更少,讓web服務器看上去相應更快;
- 減少網絡帶寬消耗 :當副本被重用時會減低客戶端的帶寬消耗;客戶可以節省帶寬費用,控制帶寬的需求的增長并更易于管理。
緩存的類型
瀏覽器緩存
對于新一代的Web瀏覽器來說(例如:IE,Firefox):一般都能在設置對話框中發現關于緩存的設置,通過在你的電腦上僻處一塊硬盤空間用于存儲你已經看過的網站的副本。瀏覽器緩存根據非常簡單的規則進行工作:在同一個會話過程中(在當前瀏覽器沒有被關閉之前)會檢查一次并確定緩存的副本足夠新。這個緩存對于用戶點擊“后退”或者點擊剛訪問過的鏈接特別有用,如果你瀏覽過程中訪問到同一個圖片,這些圖片可以從瀏覽器緩存中調出而即時顯現。
代理服務器緩存
Web代理服務器使用同樣的緩存原理,只是規模更大。代理服務器群為成百上千用戶服務使用同樣的機制;大公司和ISP經常在他們的防火墻上架設代理緩存或者單獨的緩存設備;
由于帶路服務器緩存并非客戶端或者源服務器的一部分,而是位于原網絡之外,請求必須路由到他們才能起作用。一個方法是手工設置你的瀏覽器:告訴瀏覽器使用 那個代理,另外一個是通過中間服務器:這個中間服務器處理所有的web請求,并將請求轉發到后臺網絡,而用戶不必配置代理,甚至不必知道代理的存在;
代理服務器緩存:是一個共享緩存,不只為一個用戶服務,經常為大量用戶使用,因此在減少相應時間和帶寬使用方面很有效:因為同一個副本會被重用多次。
網關緩存
也被稱為反向代理緩存或間接代理緩存,網關緩存也是一個中間服務器,和內網管理員部署緩存用于節省帶寬不同:網關緩存一般是網站管理員自己部署:讓他們的網站更容易擴展并獲得更好的性能;
請求有幾種方法被路由到網關緩存服務器上:其中典型的是讓用一臺或多臺負載均衡服務器從客戶端看上去是源服務器;
網絡內容發布商 (Content delivery networks CDNs)分布網關緩存到整個(或部分)互聯網上,并出售緩存服務給需要的網站,
Speedera
和
Akamai
就是典型的網絡內容發布商(下文簡稱CDN)。
本問主要關注于瀏覽器和代理緩存,當然,有些信息對于網關緩存也同樣有效;
Web緩存無害嗎?為什么要鼓勵緩存?
Web緩存在互聯網上最容易被誤解的技術之一:網站管理員經常怕對網站失去控制,由于代理緩存會“隱藏”他們的用戶,讓他們感覺難以監控誰在使用他們的網站。
不幸的是:就算不考慮Web緩存,互聯網上也有很多網站使用非常多的參數以便管理員精確地跟蹤用戶如何使用他們的網站;如果這類問題也是你關心的,本文將告訴你如何獲得精確的統計而不必將網站設計的非常緩存不友好。
另外一個抱怨是緩存會給用戶過期或失效的數據;無論如何:本文可以告訴你怎樣配置你的服務器來控制你的內容將被如何緩存。
CDN是另外一個有趣的方向,和其他代理緩存不同:CDN的網關緩存為希望被緩存的網站服務,沒有以上顧慮。即使你使用了CDN,你也要考慮后續的代理服務器緩存和瀏覽器緩存問題。
另外一方面:如果良好地規劃了你的網站,緩存會有助于網站服務更快,并節省服務器負載和互聯網的鏈接請求。這個改善是顯著的:一個難以緩存的網站可能需要幾秒去載入頁面,而對比有緩存的網站頁面幾乎是即時顯現:用戶更喜歡速度快的網站并更經常的訪問;
這樣想:很多大型互聯網公司為全世界服務器群投入上百萬資金,為的就是讓用戶訪問盡可能快,客戶端緩存也是這個目的,只不過更靠近用戶一端,而且最好的一點是你甚至根本不用為此付費。
事實上,無論你是否喜歡,代理服務器和瀏覽器都回啟用緩存。如果你沒有配置網站正確的緩存,他們會按照缺省或者緩存管理員的策略進行緩存。
緩存如何工作
所有的緩存都用一套規則來幫助他們決定什么時候使用緩存中的副本提供服務(假設有副本可用的情況下);一些規則在協議中有定義(HTTP協議1.0和1.1),一些規則由緩存的管理員設置(瀏覽器的用戶或者代理服務器的管理員);
一般說來:遵循以下基本的規則(不必擔心,你不必知道所有的細節,細節將隨后說明)
- 如果響應頭信息:告訴緩存器不要保留緩存,緩存器就不會緩存相應內容;
- 如果請求信息是需要認證或者安全加密的,相應內容也不會被緩存;
- 如果在回應中不存在校驗器(ETag或者Last-Modified頭信息),緩存服務器會認為缺乏直接的更新度信息,內容將會被認為不可緩存。
-
一個緩存的副本如果含有以下信息:內容將會被認為是足夠新的
- 含有完整的過期時間和壽命控制頭信息,并且內容仍在保鮮期內;
- 瀏覽器已經使用過緩存副本,并且在一個會話中已經檢查過內容的新鮮度;
- 緩存代理服務器近期內已經使用過緩存副本,并且內容的最后更新時間在上次使用期之前;
- 夠新的副本將直接從緩存中送出,而不會向源服務器發送請求;
- 如果緩存的副本已經太舊了,緩存服務器將向源服務器發出請求校驗請求,用于確定是否可以繼續使用當前拷貝繼續服務;
如果副本足夠新,從緩存中提取就立刻能用了;
而經緩存器校驗后發現副本的原件沒有變化,系統也會避免將副本內容從源服務器整個重新傳輸一遍。
如何控制(控制不)緩存
有很多工具可以幫助設計師和網站管理員調整緩存服務器對待網站的方式,這也許需要你親自下手對服務器的配置進行一些調整,但絕對值得;了解如何使用這些工具請參考后面的實現章節;
HTML meta標簽和HTTP 頭信息
HTML的編寫者會在文檔的<HEAD>區域中加入描述文檔的各種屬性,這些META標簽常常被用于標記文檔不可以被緩存或者標記多長時間后過期;
META標簽使用很簡單:但是效率并不高,因為只有幾種瀏覽器會遵循這個標記(那些真正會“讀懂”HTML的瀏覽器),沒有一種緩存代理服務器能遵循這個 規則(因為它們幾乎完全不解析文檔中HTML內容);有事會在Web頁面中增加:Pragma: no-cache這個META標記,如果要讓頁面保持刷新,這個標簽其實完全沒有必要。
如果你的網站托管在ISP機房中,并且機房可能不給你權限去控制HTTP的頭信息(如:Expires和Cache-Control),大聲控訴:這些機制對于你的工作來說是必須的;
另外一方面: HTTP頭信息可以讓你對瀏覽器和代理服務器如何處理你的副本進行更多的控制。他們在HTML代碼中是看不見的,一般由Web服務器自動生成。但是,根據 你使用的服務,你可以在某種程度上進行控制。在下文中:你將看到一些有趣的HTTP頭信息,和如何在你的站點上應用部署這些特性。
HTTP頭信息發送在HTML代碼之前,只有被瀏覽器和一些中間緩存能看到,一個典型的HTTP 1.1協議返回的頭信息看上去像這樣:
Date: Fri, 30 Oct 1998 13:19:41 GMT
Server: Apache/1.3.3 (Unix)
Cache-Control: max-age=3600, must-revalidate
Expires: Fri, 30 Oct 1998 14:19:41 GMT
Last-Modified: Mon, 29 Jun 1998 02:28:12 GMT
ETag: "3e86-410-3596fbbc"
Content-Length: 1040
Content-Type: text/html
在頭信息空一行后是HTML代碼的輸出,關于如何設置HTTP頭信息請參考實現章節;
Pragma HTTP頭信息 (為什么它不起作用)
很多人認為在HTTP頭信息中設置了Pragma: no-cache后會讓內容無法被緩存。但事實并非如此:HTTP的規范中,響應型頭信息沒有任何關于Pragma屬性的說明,而討論了的是請求型頭信息 Pragma屬性(頭信息也由瀏覽器發送給服務器),雖然少數集中緩存服務器會遵循這個頭信息,但大部分不會。用了Pragma也不起什么作用,要用就使 用下列頭信息:
使用Expires(過期時間)HTTP頭信息來控制保鮮期
Expires(過期時間) 屬性是HTTP控制緩存的基本手段,這個屬性告訴緩存器:相關副本在多長時間內是新鮮的。過了這個時間,緩存器就會向源服務器發送請求,檢查文檔是否被修改。幾乎所有的緩存服務器都支持Expires(過期時間)屬性;
大部分Web服務器支持你用幾種方式設置Expires屬性;一般的:可以設計一個絕對時間間隔:基于客戶最后查看副本的時間(最后訪問時間)或者根據服務器上文檔最后被修改的時間;
Expires頭信息:對于設置靜態圖片文件(例如導航欄和圖片按鈕)可緩存特別有用;因為這些圖片修改很少,你可以給它們設置一個特別長的過期時間,這會使你的網站對用戶變得相應非常快;他們對于控制有規律改變的網頁也很有用,例如:你每天早上6點更新新聞頁,你可以設置副本的過期時間也是這個時間,這樣緩存 服務器就知道什么時候去取一個更新版本,而不必讓用戶去按瀏覽器的“刷新”按鈕。
過期時間頭信息屬性值
只能
是HTTP格式的日期時間,其他的都會被解析成當前時間“之前”,副本會過期,記住:HTTP的日期時間必須是格林威治時間(GMT),而不是本地時間。舉例:
所以使用過期時間屬性一定要確認你的Web服務器時間設置正確,一個途徑是通過網絡時間同步協議(Network Time Protocol NTP),和你的系統管理員那里你可以了解更多細節。
雖然過期時間屬性非常有用,但是它還是有些局限,首先:是牽扯到了日期,這樣Web服務器的時間和緩存服務器的時間必須是同步的,如果有些不同步,要么是應該緩存的內容提前過期了,要么是過期結果沒及時更新。
還有一個過期時間設置的問題也不容忽視:如果你設置的過期時間是一個固定的時間,如果你返回內容的時候又沒有連帶更新下次過期的時間,那么之后所有訪問請求都會被發送給源Web服務器,反而增加了負載和響應時間;
Cache-Control(緩存控制) HTTP頭信息
HTTP 1.1介紹了另外一組頭信息屬性:Cache-Control響應頭信息,讓網站的發布者可以更全面的控制他們的內容,并定位過期時間的限制。
有用的 Cache-Control響應頭信息包括:
- max-age =[秒] — 執行緩存被認為是最新的最長時間。類似于過期時間,這個參數是基于請求時間的相對時間間隔,而不是絕對過期時間,[秒]是一個數字,單位是秒:從請求時間開始到過期時間之間的秒數。
- s-maxage =[秒] — 類似于max-age屬性,除了他應用于共享(如:代理服務器)緩存
- public — 標記認證內容也可以被緩存,一般來說: 經過HTTP認證才能訪問的內容,輸出是自動不可以緩存的;
- no-cache — 強制每次請求直接發送給源服務器,而不經過本地緩存版本的校驗。這對于需要確認認證應用很有用(可以和public結合使用),或者嚴格要求使用最新數據的應用(不惜犧牲使用緩存的所有好處);
- no-store — 強制緩存在任何情況下都不要保留任何副本
- must-revalidate — 告訴緩存必須遵循所有你給予副本的新鮮度的,HTTP允許緩存在某些特定情況下返回過期數據,指定了這個屬性,你高速緩存,你希望嚴格的遵循你的規則。
- proxy-revalidate — 和 must-revalidate類似,除了他只對緩存代理服務器起作用
舉例:
如果你計劃試用Cache-Control屬性,你應該看一下這篇HTTP文檔,詳見參考和深入閱讀;
校驗參數和校驗
在Web緩存如何工作: 我們說過:校驗是當副本已經修改后,服務器和緩存之間的通訊機制;使用這個機制:緩存服務器可以避免副本實際上仍然足夠新的情況下重復下載整個原件。
校驗參數非常重要,如果1個不存在,并且沒有任何信息說明保鮮期(Expires或Cache-Control)的情況下,緩存將不會存儲任何副本;
最常見的校驗參數是文檔的最后修改時間,通過最后Last-Modified頭信息可以,當一份緩存包含Last-Modified信息,他基于此信息,通過添加一個If-Modified-Since請求參數,向服務器查詢:這個副本從上次查看后是否被修改了。
HTTP 1.1介紹了另外一個校驗參數: ETag,服務器是服務器生成的唯一標識符ETag,每次副本的標簽都會變化。由于服務器控制了ETag如何生成,緩存服務器可以通過If-None-Match請求的返回沒變則當前副本和原件完全一致。
所有的緩存服務器都使用Last-Modified時間來確定副本是否夠新,而ETag校驗正變得越來越流行;
所有新一代的Web服務器都對靜態內容(如:文件)自動生成ETag和Last-Modified頭信息,而你不必做任何設置。但是,服務器對于動態內容(例如:CGI,ASP或數據庫生成的網站)并不知道如何生成這些信息,參考一下編寫利于緩存的腳本章節;
創建利于緩存網站的竅門
除了使用新鮮度信息和校驗,你還有很多方法使你的網站緩存友好。
- 保持URL穩定 : 這是緩存的金科玉律,如果你給在不同的頁面上,給不同用戶或者從不同的站點上提供相同的內容,應該使用相同的URL,這是使你的網站緩存友好最簡單,也是 最高效的方法。例如:如果你在頁面上使用 "/index.html" 做為引用,那么就一直用這個地址;
- 使用一個共用的庫 存放每頁都引用的圖片和其他頁面元素;
- 對于不經常改變的圖片/頁面啟用緩存 ,并使用Cache-Control: max-age屬性設置一個較長的過期時間;
- 對于定期更新的內容 設置一個緩存服務器可識別的max-age屬性或過期時間;
- 如果數據源(特別是下載文件)變更,修改名稱 ,這樣:你可以讓其很長時間不過期,并且保證服務的是正確的版本;而鏈接到下載文件的頁面是一個需要設置較短過期時間的頁面。
- 萬不得已不要改變文件 ,否則你會提供一個非常新的Last-Modified日期;例如:當你更新了網站,不要復制整個網站的所有文件,只上傳你修改的文件。
- 只在必要的時候使用Cookie ,cookie是非常難被緩存的,而且在大多數情況下是不必要的,如果使用cookie,控制在動態網頁上;
- 減少試用SSL ,加密的頁面不會被任何共享緩存服務器緩存,只在必要的時候使用,并且在SSL頁面上減少圖片的使用;
- 使用可緩存性評估引擎 ,這對于你實踐本文的很多概念都很有幫助;
編寫利于緩存的腳本
腳本缺省不會返回校驗參數(返回Last-Modified或ETag頭信息)或其他新鮮度信息(Expires或Cache-Control),有些動態腳本的確是動態內容(每次相應內容都不一樣),但是更多(搜索引擎,數據庫引擎網站)網站還是能從緩存友好中獲益的。
一般說來,如果腳本生成的輸出在未來一段時間(幾分鐘或者幾天)都是可重復復制的,那么就是可緩存的。如果腳本輸出內容只隨URL變化而變化,也是可緩存的;但如果輸出會根據cookie,認證信息或者其他外部條件變化,則還是不可緩存的。
- 最利于緩存的腳本就是將內容改變時導出成靜態文件,Web服務器可以將其當作另外一個網頁并生成和試用校驗參數,讓一些都變得更簡單,只需要寫入文件即可,這樣最后修改時間也有了;
- 另外一個讓腳本可緩存的方法是對一段時間內能保持較新的內容設置一個相對壽命的頭信息,雖然通過Expires頭信息也可以實現,但更容易的是用Cache-Control: max-age屬性,它會讓首次請求后一段時間內緩存保持新鮮;
- 如果以上做法你都做不到,你可以讓腳本生成一個校驗屬性,并對 If-Modified-Since 和/或If-None-Match請求作出反應,這些屬性可以從解析HTTP頭信息得到,并對符合條件的內容返回304 Not Modified(內容未改變),可惜的是,這種做法比不上前2種高效;
其他竅門:
- 盡量避免使用POST,除非萬不得已,POST模式的返回內容不會被大部分緩存服務器保存,如果你發送內容通過URL和查詢(通過GET模式)的內容可以緩存下來供以后使用;
- 不要在URL中加入針對每個用戶的識別信息:除非內容是針對每個用戶不同的;
- 不要統計一個用戶來自一個地址的所有請求,因為緩存常常是一起工作的;
- 生成并返回Content-Length頭信息,如果方便的話,這個屬性讓你的腳本在可持續鏈接模式時:客戶端可以通過一個TCP/IP鏈接同時請求多個副本,而不是為每次請求單獨建立鏈接,這樣你的網站相應會快很多;
常見問題解答
讓網站變得可緩存的要點是什么?
好的策略是確定那些內容最熱門,大量的復制(特別是圖片)并針對這些內容先部署緩存。
如何讓頁面通過緩存達到最快相應?
緩存最好的副本是那些可以長時間保持新鮮的內容;基于校驗雖然有助于加快相應,但是它不得不和源服務器聯系一次去檢查內容是否夠新,如果緩存服務器上就知道內容是新的,內容就可以直接相應返回了。
我理解緩存是好的,但是我不得不統計多少人訪問了我的網站!
如果你必須知道每次頁面訪問的,選擇【一】個頁面上的小元素,或者頁面本身,通過適當的頭信息讓其不可緩存,例如: 可以在每個頁面上部署一個1x1像素的透明圖片。Referer頭信息會有包含這個圖片的每個頁面信息;
明確一點:這個并不會給你一個關于你用戶精確度很高的統計,而且這對互聯網和你的用戶這都不太好,消耗了額外的帶寬,強迫用戶去訪問無法緩存的內容。了解更多信息,參考訪問統計資料。
我如何能看到HTTP頭信息的內容?
很多瀏覽器在頁面屬性或類似界面中可以讓你看到Expires 和Last-Modified信息;如果有的話:你會找到頁面信息的菜單和頁面相關的文件(如圖片),并且包含他們的詳細信息;
看到完整的頭信息,你可以用telnet手工連接到Web服務器;
為此:你可能需要用一個字段指定端口(缺省是80),或者鏈接到www.example.com:80 或者 www.example.com 80(注意是空格),更多設置請參考一下telnet客戶端的文檔;
打開網站鏈接:請求一個查看鏈接,如果你想看到http://www.example.com/foo.html 連接到www.example.com的80端口后,鍵入:
GET /foo.html HTTP/1.1 [return]
Host: www.example.com [回車][回車]
Host: www.example.com [return][return]
在[回車]處按鍵盤的回車鍵;在最后,要按2次回車,然后,就會輸出頭信息及完整頁面,如果只想看頭信息,將GET換成HEAD。
我的頁面是密碼保護的,代理緩存服務器如何處理他們?
缺省的,網頁被HTTP認證保護的都是私密內容,它們不會被任何共享緩存保留。但是,你可以通過設置Cache-Control: public讓認證頁面可緩存,HTTP 1.1標準兼容的緩存服務器會認出它們可緩存。
如果你認為這些可緩存的頁面,但是需要每個用戶認證后才能看,可以組合使用Cache-Control: public和no-cache頭信息,高速緩存必須在提供副本之前,將將新客戶的認證信息提交給源服務器。設置就是這樣:
Cache-Control: public, no-cache
無論如何:這是減少認證請求的最好方法,例如: 你的圖片是不機密的,將它們部署在另外一個目錄,并對此配置服務器不強制認證。這樣,那些圖片會缺省都緩存。
我們是否要擔心用戶通過cache訪問我的站點?
代理服務器上SSL頁面不會被緩存(不推薦被緩存),所以你不必為此擔心。但是,由于緩存保存了非SSL請求和從他們抓取的URL,你要意識到沒有安全保護的網站,可能被不道德的管理員可能搜集用戶隱私,特別是通過URL。
實際上,位于服務器和客戶端之間的管理員可以搜集這類信息。特別是通過CGI腳本在通過URL傳遞用戶名和密碼的時候會有很大問題;這對泄露用戶名和密碼是一個很大的漏洞;
如果你初步懂得互聯網的安全機制,你不會對緩存服務器有任何。
我在尋找一個包含在Web發布系統解決方案,那些是比較有緩存意識的系統?
這很難說,一般說來系統越復雜越難緩存。最差就是全動態發布并不提供校驗參數;你無發緩存任何內容。可以向系統提供商的技術人員了解一下,并參考后面的實現說明。
我的圖片設置了1個月后過期,但是我現在需要現在更新。
過期時間是繞不過去的,除非緩存(瀏覽器或者代理服務器)空間不足才會刪除副本,緩存副本在過期之間會被一直使用。
最好的辦法是改變它們的鏈接,這樣,新的副本將會從源服務器上重新下載。記住:引用它們的頁面本身也會被緩存。因此,使用靜態圖片和類似內容是很容易緩存的,而引用他們的HTML頁面則要保持非常更新;
如果你希望對指定的緩存服務器重新載入一個副本,你可以強制使用“刷新”(在FireFox中在reload的時候按住shift鍵:就會有前面提到惡Pragma: no-cache頭信息發出)。或者你可以讓緩存的管理員從他們的界面中刪除相應內容;
我運行一個Web托管服務,如何讓我的用戶發布緩存友好的網頁?
如果你使用apahe,可以考慮允許他們使用.htaccess文件并提供相應的文檔;
另外一方面: 你也可以考慮在各種虛擬主機上建立各種緩存策略。例如: 你可以設置一個目錄 /cache-1m 專門用于存放訪問1個月的訪問,另外一個 /no-cache目錄則被用提供不可存儲副本的服務。
無論如何:對于大量用戶訪問還是應該用緩存。對于大網站,這方面的節約很明顯(帶寬和服務器負載);
我標記了一些網頁是可緩存的,但是瀏覽器仍然每次發送請求給服務。如何強制他們保存副本?
緩存服務器并不會總保存副本并重用副本;他們只是在特定情況下會不保存并使用副本。所有的緩存服務器都回基于文件的大小,類型(例如:圖片 頁面),或者服務器空間的剩余來確定如何緩存。你的頁面相比更熱門或者更大的文件相比,并不值得緩存。
所以有些緩存服務器允許管理員根據文件類型確定緩存副本的優先級,允許某些副本被永久緩存并長期有效;
緩存機制的實現 - Web服務器端配置
一般說來,應該選擇最新版本的Web服務器程序來部署。不僅因為它們包含更多利于緩存的功能,新版本往往在性能和安全性方面都有很多的改善。
Apache HTTP服務器
Apache有些可選的模塊來包含這些頭信息: 包括Expires和Cache-Control。 這些模塊在1.2版本以上都支持;
這些模塊需要和apache一起編譯;雖然他們已經包含在發布版本中,但缺省并沒有啟用。為了確定相應模塊已經被啟用:找到httpd程序并運行httpd -l 它會列出可用的模塊,我們需要用的模塊是mod_expires和mod_headers
- 如果這些模塊不可用,你需要聯系管理員,重新編譯并包含這些模塊。這些模塊有時候通過配置文件中把注釋掉的配置啟用,或者在編譯的時候增加-enable -module=expires和-enable-module=headers選項(在apache 1.3和以上版本)。 參考Apache發布版中的INSTALL文件;
Apache一旦啟用了相應的模塊,你就可以在.htaccess文件或者在服務器的access.conf文件中通過mod_expires設置副本什 么時候過期。你可設置過期從訪問時間或文件修改時間開始計算,并且應用到某種文件類型上或缺省設置,參考
模塊的文檔
獲得更多信息,或者遇到問題的時候向你身邊的apache專家討教。
應用Cache-Control頭信息,你需要使用mod_headers,它將允許你設置任意的HTTP頭信息,參考
mod_headers的文檔
可以獲得更多資料;
這里有個例子說明如何使用頭信息:
-
.htaccess文件允許web發布者使用命令只在配置文件中用到的命令。他影響到所在目錄及其子目錄;問一下你的服務器管理員確認這個功能是否啟用了。
ExpiresActive On
### 設置 .gif 在被訪問過后1個月過期。
ExpiresByType image/gif A2592000
### 其他文件設置為最后修改時間1天后過期
### (用了另外的語法)
ExpiresDefault "modification plus 1 day"
### 在index.html文件應用 Cache-Control頭屬性
<Files index.html>
Header append Cache-Control "public, must-revalidate"
</Files>
- 注意: 在適當情況下mod_expires會自動計算并插入Cache-Control:max-age 頭信息
Apache 2.0的配置和1.3類似,更多信息可以參考2.0的
mod_expires
和
mod_headers文檔
;
Microsoft IIS服務器
Microsoft的IIS可以非常容易的設置頭信息,注意:這只針對IIS 4.0服務器,并且只能在NT服務器上運行。
為網站的一個區域設置頭信息,先要到管理員工具界面中,然后設置屬性。選擇HTTP Header選單,你會看到2個有趣的區域:啟用內容過期和定制HTTP頭信息。頭一個設置會自動配置,第二個可以用于設置Cache-Control頭信息;
設置asp頁面的頭信息可以參考后面的ASP章節,也可以通過ISAPI模塊設置頭信息,細節請參考MSDN。
Netscape/iPlanet企業服務器
3.6版本以后,Netscape/iPlanet已經不能設置Expires頭信息了,他從3.0版本開始支持HTTP 1.1的功能。這意味著HTTP 1.1的緩存(代理服務器/瀏覽器)優勢都可以通過你對Cache-Control設置來獲得。
使用Cache-Control頭信息,在管理服務器上選擇內容管理|緩存設置目錄。然后:使用資源選擇器,選擇你希望設置頭信息的目錄。設置完頭信息后,點擊“OK”。更多信息請參考
Netscape/iPlanet企業服務器的手冊
。
緩存機制的實現:服務器端腳本
需要注意的一點是:也許服務器設置HTTP頭信息比腳本語言更容易,但是兩者你都應該使用。
因為服務器端的腳本主要是為了動態內容,他本身不產生可緩存的文件頁面,即使內容實際是可以緩存的。如果你的內容經常改變,但是不是每次頁面請求都改變, 考慮設置一個Cache-Control: max-age頭信息;大部分用戶會在短時間內多次訪問同一頁面。例如: 用戶點擊“后退”按鈕,即使沒有新內容,他們仍然要再次從服務器下載內容查看。
CGI程序
CGI腳本是生成內容最流行的方式之一,你可以很容易在發送內容之前的擴展HTTP頭信息;大部分CGI實現都需要你寫 Content-Type頭信息,例如這個Perl腳本:
print "Content-type: text/html\n";
print "Expires: Thu, 29 Oct 1998 17:04:19 GMT\n";
print "\n";
### 后面是內容體...
由于都是文本,你可以很容易通過內置函數生成Expires和其他日期相關的頭信息。如果你使用Cache-Control: max-age;會更簡單;
這樣腳本可以在被請求后緩存10分鐘;這樣用戶如果按“后退”按鈕,他們不會重新提交請求;
CGI的規范同時也允許客戶端發送頭信息,每個頭信息都有一個‘HTTP_’的前綴;這樣如果一個客戶端發送一個If-Modified-Since請求,就是這樣的:
參考一下
cgi_buffer
庫,一個自動處理ETag的生成和校驗的庫,生成Content-Length屬性和對內容進行gzip壓縮。在Python腳本中也只需加入一行;
服務器端包含 Server Side Includes
SSI(經常使用.shtml擴展名)是網站發布者最早可以生成動態內容的方案。通過在頁面中設置特別的標記,也成為一種嵌入HTML的腳本;
大部分SSI的實現無法設置校驗器,于是無法緩存。但是Apache可以通過對特定文件的組執行權限設置實現允許用戶設置那種SSI可以被緩存;結合XbitHack調整整個目錄。更多文檔請參考
mod_include文檔
。
PHP
PHP是一個內建在web服務器中的服務器端腳本語言,當做為HTML嵌入式腳本,很像SSI,但是有更多的選項,PHP可以在各種Web服務器上設置為CGI模式運行,或者做為Apache的模塊;
缺省PHP生成副本沒有設置校驗器,于是也無法緩存,但是開發者可以通過Header()函數來生成HTTP的頭信息;
例如:以下代碼會生成一個Cache-Control頭信息,并設置為3天以后過期的Expires頭信息;
Header("Cache-Control: must-revalidate");
$offset = 60 * 60 * 24 * 3;
$ExpStr = "Expires: " . gmdate("D, d M Y H:i:s", time() + $offset) . " GMT";
Header($ExpStr);
?>
記住: Header()的輸出必須先于所有其他HTML的輸出;
正如你看到的:你可以手工創建HTTP日期;PHP沒有為你提供專門的函數(新版本已經讓這個越來越容易了,請參考PHP的
日期相關函數文檔
),當然,最簡單的還是設置Cache-Control: max-age頭信息,而且對于大部分情況都比較適用;
更多信息,請參考
header相關的文檔
;
也請參考一下
cgi_buffer
庫,自動處理ETag的生成和校驗,Content-Length生成和內容的gzip壓縮,PHP腳本只需包含1行代碼;
Cold Fusion
Cold Fusion
是Macromedia的商業服務器端腳本引擎,并且支持多種Windows平臺,Linux平臺和多種Unix平臺。Cold Fusion通過CFHEADER標記設置HTTP頭信息相對容易。可惜的是:以下的Expires頭信息的設置有些容易誤導;
它并不像你想像的那樣工作,因為時間(本例中為請求發起的時間)并不會被轉換成一個符合HTTP時間,而且打印出副本的Cold fusion的日期/時間對象,大部分客戶端會忽略或者將其轉換成1970年1月1日。
但是:Cold Fusion另外提供了一套日期格式化函數, GetHttpTimeSTring. 結合DateAdd函數,就很容易設置過期時間了,這里我們設置一個Header聲明副本在1個月以后過期;
你也可以使用CFHEADER標簽來設置Cache-Control: max-age等其他頭信息;
記住:Web服務器也會將頭信息設置轉給Cold Fusion(做為CGI運行的時候),檢查你的服務器設置并確定你是否可以利用服務器設置代替Cold Fusion。
ASP和ASP.NET
在asp中設置HTTP頭信息是:確認Response方法先于HTML內容輸出前被調用,或者使用 Response.Buffer暫存輸出;同樣的:注意某些版本的IIS缺省設置會輸出Cache-Control: private 頭信息,必須聲明成public才能被共享緩存服務器緩存。
IIS的ASP和其他web服務器都允許你設置HTTP頭信息,例如: 設置過期時間,你可以設置Response對象的屬性;
設置請求的副本在輸出的指定分鐘后過期,類似的:也可以設置絕對的過期時間(確認你的HTTP日期格式正確)
Cache-Control頭信息可以這樣設置:
在ASP.NET中,Response.Expires 已經不推薦使用了,正確的方法是通過Response.Cache設置Cache相關的頭信息;
Response.Cache.SetCacheability ( HttpCacheability.Public ) ;
參考
MSDN文檔
可以找到更多相關新年系;
參考文檔和深入閱讀
HTTP 1.1 規范定義
HTTP 1.1的規范有大量的擴展用于頁面緩存,以及權威的接口實現指南,參考章節:13, 14.9, 14.21, 以及 14.25.
Web-Caching.com
關于非連續性訪問統計
Jeff Goldberg內容豐富的演說告訴你為什么不應該過度依賴訪問統計和計數器;
可緩存性檢測引擎
可緩存的引擎設計,檢測網頁并確定其如何與Web緩存服務器交互, 這個引擎配合這篇指南是一個很好的調試工具,
cgi_buffer庫
包含庫:用于CGI模式運行的Perl/Python/PHP腳本,自動處理ETag生成/校驗,Content-Length生成和內容壓縮。正確地。 Python版本也被用作其他大量的CGI腳本。
關于本文檔
本文版權屬于Mark Nottingham <
mnot@pobox.com
>,本作品遵循
創作共用版權
。
如果你鏡像本文,請通過以上郵件告知,這樣你可以在更新時被通知;
所有的商標屬于其所有人。
雖然作者確信內容在發布時的正確性,但不保證其應用或引申應用的正確性,如有誤傳,錯誤或其他需要澄清的問題請盡快告知作者;
本文最新版本可以從
http://www.mnot.net/cache_docs/
獲得;
翻譯版本包括:
捷克語版
,
法語版
和
中文版
。
版本: 1.81 - 2007年3月16日
創作共用版權聲明
翻譯:
車東
2007年9月6日
Caching Tutorial
for Web Authors and Webmasters
- What’s a Web Cache? Why do people use them?
- Kinds of Web Caches
- Aren’t Web Caches bad for me? Why should I help them?
- How Web Caches Work
- How (and how not) to Control Caches
- Tips for Building a Cache-Aware Site
- Writing Cache-Aware Scripts
- Frequently Asked Questions
- Implementation Notes — Web Servers
- Implementation Notes — Server-Side Scripting
- References and Further Information
- About This Document
What’s a Web Cache? Why do people use them?
A Web cache sits between one or more Web servers (also known as origin servers ) and a client or many clients, and watches requests come by, saving copies of the responses — like HTML pages, images and files (collectively known as representations ) — for itself. Then, if there is another request for the same URL, it can use the response that it has, instead of asking the origin server for it again.
There are two main reasons that Web caches are used:
- To reduce latency — Because the request is satisfied from the cache (which is closer to the client) instead of the origin server, it takes less time for it to get the representation and display it. This makes the Web seem more responsive.
- To reduce network traffic — Because representations are reused, it reduces the amount of bandwidth used by a client. This saves money if the client is paying for traffic, and keeps their bandwidth requirements lower and more manageable.
Kinds of Web Caches
Browser Caches
If you examine the preferences dialog of any modern Web browser (like Internet Explorer, Safari or Mozilla), you’ll probably notice a “cache” setting. This lets you set aside a section of your computer’s hard disk to store representations that you’ve seen, just for you. The browser cache works according to fairly simple rules. It will check to make sure that the representations are fresh, usually once a session (that is, the once in the current invocation of the browser).
This cache is especially useful when users hit the “back” button or click a link to see a page they’ve just looked at. Also, if you use the same navigation images throughout your site, they’ll be served from browsers’ caches almost instantaneously.
Proxy Caches
Web proxy caches work on the same principle, but a much larger scale. Proxies serve hundreds or thousands of users in the same way; large corporations and ISPs often set them up on their firewalls, or as standalone devices (also known as intermediaries ).
Because proxy caches aren’t part of the client or the origin server, but instead are out on the network, requests have to be routed to them somehow. One way to do this is to use your browser’s proxy setting to manually tell it what proxy to use; another is using interception. Interception proxies have Web requests redirected to them by the underlying network itself, so that clients don’t need to be configured for them, or even know about them.
Proxy caches are a type of shared cache ; rather than just having one person using them, they usually have a large number of users, and because of this they are very good at reducing latency and network traffic. That’s because popular representations are reused a number of times.
Gateway Caches
Also known as “reverse proxy caches” or “surrogate caches,” gateway caches are also intermediaries, but instead of being deployed by network administrators to save bandwidth, they’re typically deployed by Webmasters themselves, to make their sites more scalable, reliable and better performing.
Requests can be routed to gateway caches by a number of methods, but typically some form of load balancer is used to make one or more of them look like the origin server to clients.
Content delivery networks (CDNs) distribute gateway caches throughout the Internet (or a part of it) and sell caching to interested Web sites. Speedera and Akamai are examples of CDNs.
This tutorial focuses mostly on browser and proxy caches, although some of the information is suitable for those interested in gateway caches as well.
Aren’t Web Caches bad for me? Why should I help them?
Web caching is one of the most misunderstood technologies on the Internet. Webmasters in particular fear losing control of their site, because a proxy cache can “hide” their users from them, making it difficult to see who’s using the site.
Unfortunately for them, even if Web caches didn’t exist, there are too many variables on the Internet to assure that they’ll be able to get an accurate picture of how users see their site. If this is a big concern for you, this tutorial will teach you how to get the statistics you need without making your site cache-unfriendly.
Another concern is that caches can serve content that is out of date, or stale . However, this tutorial can show you how to configure your server to control how your content is cached.
CDNs are an interesting development, because unlike many proxy caches, their gateway caches are aligned with the interests of the Web site being cached, so that these problems aren’t seen. However, even when you use a CDN, you still have to consider that there will be proxy and browser caches downstream.
On the other hand, if you plan your site well, caches can help your Web site load faster, and save load on your server and Internet link. The difference can be dramatic; a site that is difficult to cache may take several seconds to load, while one that takes advantage of caching can seem instantaneous in comparison. Users will appreciate a fast-loading site, and will visit more often.
Think of it this way; many large Internet companies are spending millions of dollars setting up farms of servers around the world to replicate their content, in order to make it as fast to access as possible for their users. Caches do the same for you, and they’re even closer to the end user. Best of all, you don’t have to pay for them.
The fact is that proxy and browser caches will be used whether you like it or not. If you don’t configure your site to be cached correctly, it will be cached using whatever defaults the cache’s administrator decides upon.
How Web Caches Work
All caches have a set of rules that they use to determine when to serve a representation from the cache, if it’s available. Some of these rules are set in the protocols (HTTP 1.0 and 1.1), and some are set by the administrator of the cache (either the user of the browser cache, or the proxy administrator).
Generally speaking, these are the most common rules that are followed (don’t worry if you don’t understand the details, it will be explained below):
- If the response’s headers tell the cache not to keep it, it won’t.
- If the request is authenticated or secure (i.e., HTTPS), it won’t be cached.
-
A cached representation is considered
fresh
(that is, able to be sent to a client without checking with the origin server) if:
- It has an expiry time or other age-controlling header set, and is still within the fresh period, or
- If the cache has seen the representation recently, and it was modified relatively long ago.
- If an representation is stale, the origin server will be asked to validate it, or tell the cache whether the copy that it has is still good.
- Under certain circumstances — for example, when it’s disconnected from a network — a cache can serve stale responses without checking with the origin server.
If no validator (an
ETag
or
Last-Modified
header) is present on a response,
and
it doesn't have any explicit freshness information, it will usually — but not always — be considered uncacheable.
Together, freshness and validation are the most important ways that a cache works with content. A fresh representation will be available instantly from the cache, while a validated representation will avoid sending the entire representation over again if it hasn’t changed.
How (and how not) to Control Caches
There are several tools that Web designers and Webmasters can use to fine-tune how caches will treat their sites. It may require getting your hands a little dirty with your server’s configuration, but the results are worth it. For details on how to use these tools with your server, see the Implementation sections below.
HTML Meta Tags and HTTP Headers
HTML authors can put tags in a document’s <HEAD> section that describe its attributes. These meta tags are often used in the belief that they can mark a document as uncacheable, or expire it at a certain time.
Meta tags are easy to use, but aren’t very effective. That’s because they’re only honored by a few browser caches (which actually read the HTML), not proxy caches (which almost never read the HTML in the document). While it may be tempting to put a Pragma: no-cache meta tag into a Web page, it won’t necessarily cause it to be kept fresh.
If your site is hosted at an ISP or hosting farm and they don’t give you the ability to set arbitrary HTTP headers (like
Expires
and
Cache-Control
), complain loudly; these are tools necessary for doing your job.
On the other hand, true HTTP headers give you a lot of control over how both browser caches and proxies handle your representations. They can’t be seen in the HTML, and are usually automatically generated by the Web server. However, you can control them to some degree, depending on the server you use. In the following sections, you’ll see what HTTP headers are interesting, and how to apply them to your site.
HTTP headers are sent by the server before the HTML, and only seen by the browser and any intermediate caches. Typical HTTP 1.1 response headers might look like this:
HTTP/1.1 200 OK Date: Fri, 30 Oct 1998 13:19:41 GMT Server: Apache/1.3.3 (Unix) Cache-Control: max-age=3600, must-revalidate Expires: Fri, 30 Oct 1998 14:19:41 GMT Last-Modified: Mon, 29 Jun 1998 02:28:12 GMT ETag: "3e86-410-3596fbbc" Content-Length: 1040 Content-Type: text/html
The HTML would follow these headers, separated by a blank line. See the Implementation sections for information about how to set HTTP headers.
Pragma HTTP Headers (and why they don’t work)
Many people believe that assigning a
Pragma: no-cache
HTTP header to a representation will make it uncacheable. This is not necessarily true; the HTTP specification does not set any guidelines for Pragma response headers; instead, Pragma request headers (the headers that a browser sends to a server) are discussed. Although a few caches may honor this header, the majority won’t, and it won’t have any effect. Use the headers below instead.
Controlling Freshness with the Expires HTTP Header
The
Expires
HTTP header is a basic means of controlling caches; it tells all caches how long the associated representation is fresh for. After that time, caches will always check back with the origin server to see if a document is changed.
Expires
headers are supported by practically every cache.
Most Web servers allow you to set
Expires
response headers in a number of ways. Commonly, they will allow setting an absolute time to expire, a time based on the last time that the client retrieved the representation (last
access time
), or a time based on the last time the document changed on your server (last
modification time
).
Expires
headers are especially good for making static images (like navigation bars and buttons) cacheable. Because they don’t change much, you can set extremely long expiry time on them, making your site appear much more responsive to your users. They’re also useful for controlling caching of a page that is regularly changed. For instance, if you update a news page once a day at 6am, you can set the representation to expire at that time, so caches will know when to get a fresh copy, without users having to hit ‘reload’.
The
only
value valid in an
Expires
header is a HTTP date; anything else will most likely be interpreted as ‘in the past’, so that the representation is uncacheable. Also, remember that the time in a HTTP date is Greenwich Mean Time (GMT), not local time.
For example:
Expires: Fri, 30 Oct 1998 14:19:41 GMT
It’s important to make sure that your Web server’s clock is accurate if you use the
Expires
header. One way to do this is using the
Network Time Protocol
(NTP); talk to your local system administrator to find out more.
Although the
Expires
header is useful, it has some limitations. First, because there’s a date involved, the clocks on the Web server and the cache must be synchronised; if they have a different idea of the time, the intended results won’t be achieved, and caches might wrongly consider stale content as fresh.
Another problem with
Expires
is that it’s easy to forget that you’ve set some content to expire at a particular time. If you don’t update an
Expires
time before it passes, each and every request will go back to your Web server, increasing load and latency.
Cache-Control HTTP Headers
HTTP 1.1 introduced a new class of headers,
Cache-Control
response headers, to give Web publishers more control over their content, and to address the limitations of
Expires
.
Useful
Cache-Control
response headers include:
-
max-age=
[seconds] — specifies the maximum amount of time that an representation will be considered fresh. Similar toExpires
, this directive is relative to the time of the request, rather than absolute. [seconds] is the number of seconds from the time of the request you wish the representation to be fresh for. -
s-maxage=
[seconds] — similar tomax-age
, except that it only applies to shared (e.g., proxy) caches. -
public
— marks authenticated responses as cacheable; normally, if HTTP authentication is required, responses are automatically private. -
private
— allows caches that are specific to one user (e.g., in a browser) to store the response; shared caches (e.g., in a proxy) may not. -
no-cache
— forces caches to submit the request to the origin server for validation before releasing a cached copy, every time. This is useful to assure that authentication is respected (in combination with public), or to maintain rigid freshness, without sacrificing all of the benefits of caching. -
no-store
— instructs caches not to keep a copy of the representation under any conditions. -
must-revalidate
— tells caches that they must obey any freshness information you give them about a representation. HTTP allows caches to serve stale representations under special conditions; by specifying this header, you’re telling the cache that you want it to strictly follow your rules. -
proxy-revalidate
— similar tomust-revalidate
, except that it only applies to proxy caches.
For example:
Cache-Control: max-age=3600, must-revalidate
If you plan to use the
Cache-Control
headers, you should have a look at the excellent documentation in HTTP 1.1; see
References and Further Information
.
Validators and Validation
In How Web Caches Work , we said that validation is used by servers and caches to communicate when an representation has changed. By using it, caches avoid having to download the entire representation when they already have a copy locally, but they’re not sure if it’s still fresh.
Validators are very important; if one isn’t present, and there isn’t any freshness information (
Expires
or
Cache-Control
) available, caches will not store a representation at all.
The most common validator is the time that the document last changed, as communicated in
Last-Modified
header. When a cache has an representation stored that includes a
Last-Modified
header, it can use it to ask the server if the representation has changed since the last time it was seen, with an
If-Modified-Since
request.
HTTP 1.1 introduced a new kind of validator called the
ETag
. ETags are unique identifiers that are generated by the server and changed every time the representation does. Because the server controls how the ETag is generated, caches can be surer that if the ETag matches when they make a
If-None-Match
request, the representation really is the same.
Almost all caches use Last-Modified times in determining if an representation is fresh; ETag validation is also becoming prevalent.
Most modern Web servers will generate both
ETag
and
Last-Modified
headers to use as validators for static content (i.e., files) automatically; you won’t have to do anything. However, they don’t know enough about dynamic content (like CGI, ASP or database sites) to generate them; see
Writing Cache-Aware Scripts
.
Tips for Building a Cache-Aware Site
Besides using freshness information and validation, there are a number of other things you can do to make your site more cache-friendly.
- Use URLs consistently — this is the golden rule of caching. If you serve the same content on different pages, to different users, or from different sites, it should use the same URL. This is the easiest and most effective way to make your site cache-friendly. For example, if you use “/index.html” in your HTML as a reference once, always use it that way.
- Use a common library of images and other elements and refer back to them from different places.
-
Make caches store images and pages that don’t change often
by using a
Cache-Control: max-age
header with a large value. - Make caches recognise regularly updated pages by specifying an appropriate max-age or expiration time.
- If a resource (especially a downloadable file) changes, change its name. That way, you can make it expire far in the future, and still guarantee that the correct version is served; the page that links to it is the only one that will need a short expiry time.
-
Don’t change files unnecessarily.
If you do, everything will have a falsely young
Last-Modified
date. For instance, when updating your site, don’t copy over the entire site; just move the files that you’ve changed. - Use cookies only where necessary — cookies are difficult to cache, and aren’t needed in most situations. If you must use a cookie, limit its use to dynamic pages.
- Minimize use of SSL — because encrypted pages are not stored by shared caches, use them only when you have to, and use images on SSL pages sparingly.
- Check your pages with REDbot — it can help you apply many of the concepts in this tutorial.
Writing Cache-Aware Scripts
By default, most scripts won’t return a validator (a
Last-Modified
or
ETag
response header) or freshness information (
Expires
or
Cache-Control
). While some scripts really are dynamic (meaning that they return a different response for every request), many (like search engines and database-driven sites) can benefit from being cache-friendly.
Generally speaking, if a script produces output that is reproducible with the same request at a later time (whether it be minutes or days later), it should be cacheable. If the content of the script changes only depending on what’s in the URL, it is cacheble; if the output depends on a cookie, authentication information or other external criteria, it probably isn’t.
-
The best way to make a script cache-friendly (as well as perform better) is to dump its content to a plain file whenever it changes. The Web server can then treat it like any other Web page, generating and using validators, which makes your life easier. Remember to only write files that have changed, so the
Last-Modified
times are preserved. -
Another way to make a script cacheable in a limited fashion is to set an age-related header for as far in the future as practical. Although this can be done with
Expires
, it’s probably easiest to do so withCache-Control: max-age
, which will make the request fresh for an amount of time after the request. -
If you can’t do that, you’ll need to make the script generate a validator, and then respond to
If-Modified-Since
and/orIf-None-Match
requests. This can be done by parsing the HTTP headers, and then responding with304 Not Modified
when appropriate. Unfortunately, this is not a trival task.
Some other tips;
- Don’t use POST unless it’s appropriate. Responses to the POST method aren’t kept by most caches; if you send information in the path or query (via GET), caches can store that information for the future.
- Don’t embed user-specific information in the URL unless the content generated is completely unique to that user.
- Don’t count on all requests from a user coming from the same host , because caches often work together.
-
Generate
Content-Length
response headers. It’s easy to do, and it will allow the response of your script to be used in a persistent connection . This allows clients to request multiple representations on one TCP/IP connection, instead of setting up a connection for every request. It makes your site seem much faster.
See the Implementation Notes for more specific information.
Frequently Asked Questions
What are the most important things to make cacheable?
A good strategy is to identify the most popular, largest representations (especially images) and work with them first.
How can I make my pages as fast as possible with caches?
The most cacheable representation is one with a long freshness time set. Validation does help reduce the time that it takes to see a representation, but the cache still has to contact the origin server to see if it’s fresh. If the cache already knows it’s fresh, it will be served directly.
I understand that caching is good, but I need to keep statistics on how many people visit my page!
If you must know every time a page is accessed, select ONE small item on a page (or the page itself), and make it uncacheable, by giving it a suitable headers. For example, you could refer to a 1x1 transparent uncacheable image from each page. The
Referer
header will contain information about what page called it.
Be aware that even this will not give truly accurate statistics about your users, and is unfriendly to the Internet and your users; it generates unnecessary traffic, and forces people to wait for that uncached item to be downloaded. For more information about this, see On Interpreting Access Statistics in the references .
How can I see a representation’s HTTP headers?
Many Web browsers let you see the
Expires
and
Last-Modified
headers are in a “page info” or similar interface. If available, this will give you a menu of the page and any representations (like images) associated with it, along with their details.
To see the full headers of a representation, you can manually connect to the Web server using a Telnet client.
To do so, you may need to type the port (be default, 80) into a separate field, or you may need to connect to
www.example.com:80
or
www.example.com 80
(note the space). Consult your Telnet client’s documentation.
Once you’ve opened a connection to the site, type a request for the representation. For instance, if you want to see the headers for
http://www.example.com/foo.html
, connect to
www.example.com
, port
80
, and type:
GET /foo.html HTTP/1.1 [return] Host: www.example.com [return][return]
Press the Return key every time you see
[return]
; make sure to press it twice at the end. This will print the headers, and then the full representation. To see the headers only, substitute HEAD for GET.
My pages are password-protected; how do proxy caches deal with them?
By default, pages protected with HTTP authentication are considered private; they will not be kept by shared caches. However, you can make authenticated pages public with a Cache-Control: public header; HTTP 1.1-compliant caches will then allow them to be cached.
If you’d like such pages to be cacheable, but still authenticated for every user, combine the
Cache-Control: public
and
no-cache
headers. This tells the cache that it must submit the new client’s authentication information to the origin server before releasing the representation from the cache. This would look like:
Cache-Control: public, no-cache
Whether or not this is done, it’s best to minimize use of authentication; for example, if your images are not sensitive, put them in a separate directory and configure your server not to force authentication for it. That way, those images will be naturally cacheable.
Should I worry about security if people access my site through a cache?
SSL pages are not cached (or decrypted) by proxy caches, so you don’t have to worry about that. However, because caches store non-SSL requests and URLs fetched through them, you should be conscious about unsecured sites; an unscrupulous administrator could conceivably gather information about their users, especially in the URL.
In fact, any administrator on the network between your server and your clients could gather this type of information. One particular problem is when CGI scripts put usernames and passwords in the URL itself; this makes it trivial for others to find and user their login.
If you’re aware of the issues surrounding Web security in general, you shouldn’t have any surprises from proxy caches.
I’m looking for an integrated Web publishing solution. Which ones are cache-aware?
It varies. Generally speaking, the more complex a solution is, the more difficult it is to cache. The worst are ones which dynamically generate all content and don’t provide validators; they may not be cacheable at all. Speak with your vendor’s technical staff for more information, and see the Implementation notes below.
My images expire a month from now, but I need to change them in the caches now!
The Expires header can’t be circumvented; unless the cache (either browser or proxy) runs out of room and has to delete the representations, the cached copy will be used until then.
The most effective solution is to change any links to them; that way, completely new representations will be loaded fresh from the origin server. Remember that the page that refers to an representation will be cached as well. Because of this, it’s best to make static images and similar representations very cacheable, while keeping the HTML pages that refer to them on a tight leash.
If you want to reload an representation from a specific cache, you can either force a reload (in Firefox, holding down shift while pressing ‘reload’ will do this by issuing a
Pragma: no-cache
request header) while using the cache. Or, you can have the cache administrator delete the representation through their interface.
I run a Web Hosting service. How can I let my users publish cache-friendly pages?
If you’re using Apache, consider allowing them to use .htaccess files and providing appropriate documentation.
Otherwise, you can establish predetermined areas for various caching attributes in each virtual server. For instance, you could specify a directory /cache-1m that will be cached for one month after access, and a /no-cache area that will be served with headers instructing caches not to store representations from it.
Whatever you are able to do, it is best to work with your largest customers first on caching. Most of the savings (in bandwidth and in load on your servers) will be realized from high-volume sites.
I’ve marked my pages as cacheable, but my browser keeps requesting them on every request. How do I force the cache to keep representations of them?
Caches aren’t required to keep a representation and reuse it; they’re only required to not keep or use them under some conditions. All caches make decisions about which representations to keep based upon their size, type (e.g., image vs. html), or by how much space they have left to keep local copies. Yours may not be considered worth keeping around, compared to more popular or larger representations.
Some caches do allow their administrators to prioritize what kinds of representations are kept, and some allow representations to be “pinned” in cache, so that they’re always available.
Implementation Notes — Web Servers
Generally speaking, it’s best to use the latest version of whatever Web server you’ve chosen to deploy. Not only will they likely contain more cache-friendly features, new versions also usually have important security and performance improvements.
Apache HTTP Server
Apache uses optional modules to include headers, including both Expires and Cache-Control. Both modules are available in the 1.2 or greater distribution.
The modules need to be built into Apache; although they are included in the distribution, they are not turned on by default. To find out if the modules are enabled in your server, find the httpd binary and run
httpd -l
; this should print a list of the available modules (note that this only lists compiled-in modules; on later versions of Apache, use
httpd -M
to include dynamically loaded modules as well). The modules we’re looking for are mod_expires and mod_headers.
-
If they aren’t available, and you have administrative access, you can recompile Apache to include them. This can be done either by uncommenting the appropriate lines in the Configuration file, or using the
-enable-module=expires
and-enable-module=headers
arguments to configure (1.3 or greater). Consult the INSTALL file found with the Apache distribution.
Once you have an Apache with the appropriate modules, you can use mod_expires to specify when representations should expire, either in .htaccess files or in the server’s access.conf file. You can specify expiry from either access or modification time, and apply it to a file type or as a default. See the module documentation for more information, and speak with your local Apache guru if you have trouble.
To apply
Cache-Control
headers, you’ll need to use the mod_headers module, which allows you to specify arbitrary HTTP headers for a resource. See
the mod_headers documentation
.
Here’s an example .htaccess file that demonstrates the use of some headers.
- .htaccess files allow web publishers to use commands normally only found in configuration files. They affect the content of the directory they’re in and their subdirectories. Talk to your server administrator to find out if they’re enabled.
### activate mod_expires ExpiresActive On ### Expire .gif's 1 month from when they're accessed ExpiresByType image/gif A2592000 ### Expire everything else 1 day from when it's last modified ### (this uses the Alternative syntax) ExpiresDefault "modification plus 1 day" ### Apply a Cache-Control header to index.html <Files index.html> Header append Cache-Control "public, must-revalidate" </Files>
-
Note that mod_expires automatically calculates and inserts a
Cache-Control:max-age
header as appropriate.
Apache 2’s configuration is very similar to that of 1.3; see the 2.2 mod_expires and mod_headers documentation for more information.
Microsoft IIS
Microsoft ’s Internet Information Server makes it very easy to set headers in a somewhat flexible way. Note that this is only possible in version 4 of the server, which will run only on NT Server.
To specify headers for an area of a site, select it in the
Administration Tools
interface, and bring up its properties. After selecting the
HTTP Headers
tab, you should see two interesting areas;
Enable Content Expiration
and
Custom HTTP headers
. The first should be self-explanatory, and the second can be used to apply Cache-Control headers.
See the ASP section below for information about setting headers in Active Server Pages. It is also possible to set headers from ISAPI modules; refer to MSDN for details.
Netscape/iPlanet Enterprise Server
As of version 3.6, Enterprise Server does not provide any obvious way to set Expires headers. However, it has supported HTTP 1.1 features since version 3.0. This means that HTTP 1.1 caches (proxy and browser) will be able to take advantage of Cache-Control settings you make.
To use Cache-Control headers, choose
Content Management | Cache Control Directives
in the administration server. Then, using the Resource Picker, choose the directory where you want to set the headers. After setting the headers, click ‘OK’. For more information, see the
NES manual
.
Implementation Notes — Server-Side Scripting
One thing to keep in mind is that it may be easier to set HTTP headers with your Web server rather than in the scripting language. Try both.
Because the emphasis in server-side scripting is on dynamic content, it doesn’t make for very cacheable pages, even when the content could be cached. If your content changes often, but not on every page hit, consider setting a Cache-Control: max-age header; most users access pages again in a relatively short period of time. For instance, when users hit the ‘back’ button, if there isn’t any validator or freshness information available, they’ll have to wait until the page is re-downloaded from the server to see it.
CGI
CGI scripts are one of the most popular ways to generate content. You can easily append HTTP response headers by adding them before you send the body; Most CGI implementations already require you to do this for the
Content-Type
header. For instance, in Perl;
#!/usr/bin/perl print "Content-type: text/html\n"; print "Expires: Thu, 29 Oct 1998 17:04:19 GMT\n"; print "\n"; ### the content body follows...
Since it’s all text, you can easily generate
Expires
and other date-related headers with in-built functions. It’s even easier if you use
Cache-Control: max-age
;
print "Cache-Control: max-age=600\n";
This will make the script cacheable for 10 minutes after the request, so that if the user hits the ‘back’ button, they won’t be resubmitting the request.
The CGI specification also makes request headers that the client sends available in the environment of the script; each header has ‘HTTP_’ prepended to its name. So, if a client makes an
If-Modified-Since
request, it will show up as
HTTP_IF_MODIFIED_SINCE
.
See also the
cgi_buffer
library, which automatically handles ETag generation and validation,
Content-Length
generation and gzip content-coding for Perl and Python CGI scripts with a one-line include. The Python version can also be used to wrap arbitrary CGI scripts with.
Server Side Includes
SSI (often used with the extension .shtml) is one of the first ways that Web publishers were able to get dynamic content into pages. By using special tags in the pages, a limited form of in-HTML scripting was available.
Most implementations of SSI do not set validators, and as such are not cacheable. However, Apache’s implementation does allow users to specify which SSI files can be cached, by setting the group execute permissions on the appropriate files, combined with the
XbitHack full
directive. For more information, see the
mod_include documentation
.
PHP
PHP is a server-side scripting language that, when built into the server, can be used to embed scripts inside a page’s HTML, much like SSI, but with a far larger number of options. PHP can be used as a CGI script on any Web server (Unix or Windows), or as an Apache module.
By default, representations processed by PHP are not assigned validators, and are therefore uncacheable. However, developers can set HTTP headers by using the
Header()
function.
For example, this will create a Cache-Control header, as well as an Expires header three days in the future:
<?php Header("Cache-Control: must-revalidate"); $offset = 60 * 60 * 24 * 3; $ExpStr = "Expires: " . gmdate("D, d M Y H:i:s", time() + $offset) . " GMT"; Header($ExpStr); ?>
Remember that the
Header()
function MUST come before any other output.
As you can see, you’ll have to create the HTTP date for an
Expires
header by hand; PHP doesn’t provide a function to do it for you (although recent versions have made it easier; see the
PHP's date documentation
). Of course, it’s easy to set a
Cache-Control: max-age header
, which is just as good for most situations.
For more information, see the manual entry for header .
See also the
cgi_buffer
library, which automatically handles
ETag
generation and validation,
Content-Length
generation and gzip content-coding for PHP scripts with a one-line include.
Cold Fusion
Cold Fusion , by Macromedia is a commercial server-side scripting engine, with support for several Web servers on Windows, Linux and several flavors of Unix.
Cold Fusion makes setting arbitrary HTTP headers relatively easy, with the
CFHEADER
tag. Unfortunately, their example for setting an
Expires
header, as below, is a bit misleading.
<CFHEADER NAME="Expires" VALUE="#Now()#">
It doesn’t work like you might think, because the time (in this case, when the request is made) doesn’t get converted to a HTTP-valid date; instead, it just gets printed as a representation of Cold Fusion’s Date/Time object. Most clients will either ignore such a value, or convert it to a default, like January 1, 1970.
However, Cold Fusion does provide a date formatting function that will do the job;
GetHttpTimeString
. In combination with
DateAdd
, it’s easy to set Expires dates; here, we set a header to declare that representations of the page expire in one month;
<cfheader name="Expires" value="#GetHttpTimeString(DateAdd('m', 1, Now()))#">
You can also use the
CFHEADER
tag to set
Cache-Control: max-age
and other headers.
Remember that Web server headers are passed through in some deployments of Cold Fusion (such as CGI); check yours to determine whether you can use this to your advantage, by setting headers on the server instead of in Cold Fusion.
ASP and ASP.NET
When setting HTTP headers from ASPs, make sure you either place the Response method calls before any HTML generation, or use
Response.Buffer
to buffer the output. Also, note that some versions of IIS set a
Cache-Control: private
header on ASPs by default, and must be declared public to be cacheable by shared caches.
Active Server Pages, built into IIS and also available for other Web servers, also allows you to set HTTP headers. For instance, to set an expiry time, you can use the properties of the
Response
object;
<% Response.Expires=1440 %>
specifying the number of minutes from the request to expire the representation.
Cache-Control
headers can be added like this:
<% Response.CacheControl="public" %>
In ASP.NET,
Response.Expires
is deprecated; the proper way to set cache-related headers is with
Response.Cache
;
Response.Cache.SetExpires ( DateTime.Now.AddMinutes ( 60 ) ) ; Response.Cache.SetCacheability ( HttpCacheability.Public ) ;
See the MSDN documentation for more information.
References and Further Information
HTTP 1.1 Specification
The HTTP 1.1 spec has many extensions for making pages cacheable, and is the authoritative guide to implementing the protocol. See sections 13, 14.9, 14.21, and 14.25.
Web-Caching.com
An excellent introduction to caching concepts, with links to other online resources.
On Interpreting Access Statistics
Jeff Goldberg’s informative rant on why you shouldn’t rely on access statistics and hit counters.
REDbot
Examines HTTP resources to determine how they will interact with Web caches, and generally how well they use the protocol.
cgi_buffer Library
One-line include in Perl CGI, Python CGI and PHP scripts automatically handles ETag generation and validation, Content-Length generation and gzip Content-Encoding — correctly. The Python version can also be used as a wrapper around arbitrary CGI scripts.
About This Document
This document is Copyright ? 1998-2010 Mark Nottingham < mnot@pobox.com >.This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 Unported License .
All trademarks within are property of their respective holders.
Although the author believes the contents to be accurate at the time of publication, no liability is assumed for them, their application or any consequences thereof. If any misrepresentations, errors or other need for clarification is found, please contact the author immediately.
The latest revision of this document can always be obtained from http://www.mnot.net/cache_docs/
Translations are available in: Belarusian , Chinese , Czech , German , and French .
June 29, 2010

引用通告
以下是前來引用的鏈接: 面向站長和網站管理員的Web緩存加速指南[翻譯] :
?
面向站長和網站管理員的Web緩存加速指南[翻譯]
來自 筆記 by 車東
原文(英文)地址: http://www.mnot.net/cache_docs...
[閱讀更多細節]
更多文章、技術交流、商務合作、聯系博主
微信掃碼或搜索:z360901061

微信掃一掃加我為好友
QQ號聯系: 360901061
您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點擊下面給點支持吧,站長非常感激您!手機微信長按不能支付解決辦法:請將微信支付二維碼保存到相冊,切換到微信,然后點擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對您有幫助就好】元
